PP POMPA (WG6) Overview Talk COSMO GM11, Rome st Birthday.

Slides:



Advertisements
Similar presentations
Vectors, SIMD Extensions and GPUs COMP 4611 Tutorial 11 Nov. 26,
Advertisements

Multiprocessors— Large vs. Small Scale Multiprocessors— Large vs. Small Scale.
Lecture 6: Multicore Systems
Scalable Multi-Cache Simulation Using GPUs Michael Moeng Sangyeun Cho Rami Melhem University of Pittsburgh.
Monte-Carlo method and Parallel computing  An introduction to GPU programming Mr. Fang-An Kuo, Dr. Matthew R. Smith NCHC Applied Scientific Computing.
Thoughts on Shared Caches Jeff Odom University of Maryland.
GPU System Architecture Alan Gray EPCC The University of Edinburgh.
Early Linpack Performance Benchmarking on IPE Mole-8.5 Fermi GPU Cluster Xianyi Zhang 1),2) and Yunquan Zhang 1),3) 1) Laboratory of Parallel Software.
GPGPU Introduction Alan Gray EPCC The University of Edinburgh.
Alternative Processors 5/22/20151 John Gustafson CEO, Massively Parallel Technologies (Former CTO, ClearSpeed)
Weekly Report Start learning GPU Ph.D. Student: Leo Lee date: Sep. 18, 2009.
University of Michigan Electrical Engineering and Computer Science Amir Hormati, Mehrzad Samadi, Mark Woh, Trevor Mudge, and Scott Mahlke Sponge: Portable.
Contemporary Languages in Parallel Computing Raymond Hummel.
GPGPU overview. Graphics Processing Unit (GPU) GPU is the chip in computer video cards, PS3, Xbox, etc – Designed to realize the 3D graphics pipeline.
ORIGINAL AUTHOR JAMES REINDERS, INTEL PRESENTED BY ADITYA AMBARDEKAR Overview for Intel Xeon Processors and Intel Xeon Phi coprocessors.
PP POMPA (WG6) Overview Talk COSMO GM12, Lugano Oliver Fuhrer (MeteoSwiss) and the whole POMPA project team.
Porting physical parametrizations to GPUs using compiler directives X. Lapillonne, O. Fuhrer Eidgenössisches Departement des Innern EDI Bundesamt für Meteorologie.
GPU Programming with CUDA – Accelerated Architectures Mike Griffiths
Lecture 29 Fall 2006 Lecture 29: Parallel Programming Overview.
OpenMP in a Heterogeneous World Ayodunni Aribuki Advisor: Dr. Barbara Chapman HPCTools Group University of Houston.
Status of Dynamical Core C++ Rewrite (Task 5) Oliver Fuhrer (MeteoSwiss), Tobias Gysi (SCS), Men Muhheim (SCS), Katharina Riedinger (SCS), David Müller.
ICOM 5995: Performance Instrumentation and Visualization for High Performance Computer Systems Lecture 7 October 16, 2002 Nayda G. Santiago.
Extracted directly from:
Revisiting Kirchhoff Migration on GPUs Rice Oil & Gas HPC Workshop
Pursuing Faster I/O in COSMO POMPA Workshop May 3rd 2010.
© David Kirk/NVIDIA and Wen-mei W. Hwu, 1 Programming Massively Parallel Processors Lecture Slides for Chapter 1: Introduction.
General Purpose Computing on Graphics Processing Units: Optimization Strategy Henry Au Space and Naval Warfare Center Pacific 09/12/12.
Porting the physical parametrizations on GPUs using directives X. Lapillonne, O. Fuhrer, Cristiano Padrin, Piero Lanucara, Alessandro Cheloni Eidgenössisches.
Experience with COSMO MPI/OpenMP hybrid parallelization Matthew Cordery, William Sawyer Swiss National Supercomputing Centre Ulrich Schättler Deutscher.
Federal Department of Home Affairs FDHA Federal Office of Meteorology and Climatology MeteoSwiss Operational COSMO Demonstrator OPCODE André Walser and.
GPU in HPC Scott A. Friedman ATS Research Computing Technologies.
Evaluating FERMI features for Data Mining Applications Masters Thesis Presentation Sinduja Muralidharan Advised by: Dr. Gagan Agrawal.
GPU Architecture and Programming
Introducing collaboration members – Korea University (KU) ALICE TPC online tracking algorithm on a GPU Computing Platforms – GPU Computing Platforms Joohyung.
Lixia Liu, Zhiyuan Li Purdue University, USA PPOPP 2010, January 2009.
Hardware Acceleration Using GPUs M Anirudh Guide: Prof. Sachin Patkar VLSI Consortium April 4, 2008.
1 06/09/2011, COSMO GM Xavier Lapillonne Porting the physical parametrizations on GPU using directives X. Lapillonne, O. Fuhrer Eidgenössisches Departement.
Parallelization of likelihood functions for data analysis Alfio Lazzaro CERN openlab Forum on Concurrent Programming Models and Frameworks.
Manno, , © by Supercomputing Systems 1 1 COSMO - Dynamical Core Rewrite Approach, Rewrite and Status Tobias Gysi POMPA Workshop, Manno,
1)Leverage raw computational power of GPU  Magnitude performance gains possible.
Status of Dynamical Core C++ Rewrite Oliver Fuhrer (MeteoSwiss), Tobias Gysi (SCS), Men Muhheim (SCS), Katharina Riedinger (SCS), David Müller (SCS), Thomas.
Innovation for Our Energy Future Opportunities for WRF Model Acceleration John Michalakes Computational Sciences Center NREL Andrew Porter Computational.
1 November 11, 2015 A Massively Parallel, Hybrid Dataflow/von Neumann Architecture Yoav Etsion November 11, 2015.
Weekly Report- Reduction Ph.D. Student: Leo Lee date: Oct. 30, 2009.
Priority Project Performance On Massively Parallel Architectures (POMPA) Nice to meet you! COSMO GM10, Moscow.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, University of Illinois, Urbana-Champaign 1 Graphic Processing Processors (GPUs) Parallel.
1 HPJAVA I.K.UJJWAL 07M11A1217 Dept. of Information Technology B.S.I.T.
GPGPU introduction. Why is GPU in the picture Seeking exa-scale computing platform Minimize power per operation. – Power is directly correlated to the.
3/12/2013Computer Engg, IIT(BHU)1 CUDA-3. GPGPU ● General Purpose computation using GPU in applications other than 3D graphics – GPU accelerates critical.
December 13, G raphical A symmetric P rocessing Prototype Presentation December 13, 2004.
Parallel Computers Today Oak Ridge / Cray Jaguar > 1.75 PFLOPS Two Nvidia 8800 GPUs > 1 TFLOPS Intel 80- core chip > 1 TFLOPS  TFLOPS = floating.
GPU Computing for GIS James Mower Department of Geography and Planning University at Albany.
Institute of Software,Chinese Academy of Sciences An Insightful and Quantitative Performance Optimization Chain for GPUs Jia Haipeng.
SixTrack for GPU R. De Maria. SixTrack Status SixTrack: Single Particle Tracking Code [cern.ch/sixtrack]. 70K lines written in Fortran 77/90 (with few.
Heterogeneous Processing KYLE ADAMSKI. Overview What is heterogeneous processing? Why it is necessary Issues with heterogeneity CPU’s vs. GPU’s Heterogeneous.
11 Brian Van Straalen Portable Performance Discussion August 7, FASTMath SciDAC Institute.
LLNL-PRES This work was performed under the auspices of the U.S. Department of Energy by Lawrence Livermore National Laboratory under Contract DE-AC52-07NA27344.
Application of Emerging Computational Architectures (GPU, MIC) to Atmospheric Modeling Tom Henderson NOAA Global Systems Division
Jun Doi IBM Research – Tokyo Early Performance Evaluation of Lattice QCD on POWER+GPU Cluster 17 July 2015.
Application of Design Patterns to Geometric Decompositions V. Balaji, Thomas L. Clune, Robert W. Numrich and Brice T. Womack.
PT Evaluation of the Dycore Parallel Phase (EDP2)
PP POMPA status Xavier Lapillonne.
Lecture 5: GPU Compute Architecture
Lecture 5: GPU Compute Architecture for the last time
Discussion HPC Priority project for COSMO consortium
PASC PASCHA Project The next HPC step for the COSMO model
Cristiano Padrin (CASPUR)
Non blocking communications in RK dynamics
6- General Purpose GPU Programming
Presentation transcript:

PP POMPA (WG6) Overview Talk COSMO GM11, Rome st Birthday

Who is POMPA? ARPA-EMR Davide Cesari C2SM/ETH Xavier Lapillonne, Anne Roches, Carlos Osuna CASPUR Stefano Zampini, Piero Lanucara, Cristiano Padrin Cray Pozanovich Jeffrey, Roberto Ansaloni CSCS Matthew Cordery, Mauro Biancho, Jean-Guillaume Piccinali, William Sawyer, Neil Stringfellow, Thomas Schulthess, Ugo Varetto DWD Ulrich Schättler, Kristina Fröhlich KIT Andrew Ferrone, Hartwig Anzt MeteoSwiss Petra Baumann, Oliver Fuhrer, André Walser NVIDIA Tim Schröder, Thomas Bradley Roshydromet Dmitry Mikushin SCS Tobias Gysi, Men Muheim, David Müller, Katharina Riedinger USAM David Palella, Alessandro Cheloni, Pier Francesco Coppola USI Daniel Ruprecht

Kickoff Workshop May , hosted by CSCS in Manno 15 talks, 18 participants Goal get to know each other, report on work already done, plan and coordinate future activities Revised project plan

Task Overview Task 1 Performance analysis and documentation Task 2 Redesign memory layout and data structures Closely linked to work in Task 5 and 6 Task 3 Improve current parallelization Task 4 Parallel I/O Focus on NetCDF (which is still from 1 core) Technical problems New person (Carlos Osuna, C2SM) starting work on Task 5 Redesign implementation of dynamical core Task 6 Explore GPU acceleration Task 7 Implementation documentation No progress

Performance Analysis Goal -Understand the code from a performance perspective (workflow, data movement, bottlenecks, problems, …) -Guide and prioritize the work in the other tasks -Try to ensure exchange of information and performance portability developments

Performance Analysis (Task 1) Work COSMO RAPS 5.0 benchmark with DWD, MeteoSwiss and IPCC/ETH runscripts on hpcforge.org ( Ulrich Schättler, Oliver Fuhrer, Anne Roches ) Workflow of RK timestep ( Ulrich Schättler ) Performance analysis COSMO RAPS 5.0 on Cray XT4, XT5 and XE6 ( Jean-Guillaume Piccinali, Anne Roches ) COSMO-ART ( Oliver Fuhrer ) Wiki page

Jean-Guillaume Piccinali and Anne Roches

Problem: Overfetching Computational intensity is the ration of floating point operations (ops) per memory reference (ref) When accessing a single array value, a complete cache line (64 Bytes = 8 double precision values) is loaded into L1 cache do i = 1+nbounlines, ie-nbounlines A(i) = 0.0d0 end do … also loads A(1), A(2), A(3) If subdomain on processor is very small many values loaded from memory never get used for computation A(1)A(2)A(3)A(4)…A(ie-3)A(ie-2)A(ie-1)A(ie)

Performance Analysis: Wiki

Improve Current Parallelization (Task 2) Loop level hybrid parallelization (OpenMP/MPI) ( Matthew Cordery, Davide Cesari, Stefano Zampini ) No clear benefit of this approach vs. flat MPI parallelization Approach suitable for memory bandwidth bound code? Restructuring of code (into blocks) may help! Overlap communication with computation using non-blocking MPI calls ( Stefano Zampini ) Lumped halo-updates for COSMO-ART ( Christoph Knote, Andrew Ferrone )

Halo exchange in Cosmo 3 types of point to point communications: 2 partially non-blocking and 1 full blocking (with MPI_SENDRECV) Halo swapping needs completion of East to West before starting South to North communication (implicit corner exchange) New version which communicates corners (2x more messages) Stefano Zampini

New halo-exchange routine Stefano Zampini Compute ASend AReceive AUse A CALL exch_boundaries(A) communication time OLD CALL exch_boundaries(A,2) CALL exch_boundaries(A,3) communication time NEW

Early results: COSMO2 Total time (s) for model runs Mean total time for RK dynamics Is Testany / Waitany the most efficient way to assure completion? Restructuring of code to find more work (B) could help!

Explore GPU Acceleration (Task 6) Goal Investigate whether and how GPUs can be leveraged for numerical weather prediction with COSMO Background Early investigations by Michalakes et al. using WRF physical parametrizations Full port of JMA next-generation model (ASUCA) to GPUs via a rewrite in CUDA New model developments (e.g. NIM at NOAA) which have GPUs as a target architecture in mind from the very start

GPU Motivation Chip Architecture Peak Performance Memory Bandwidth Power Consumption Price per Node Intel Westmere GHz 81.6 GFlops 32 GB/s 130 Watt X $ NVIDIA Fermi M GHz 665 GFlops 155 GB/s 225 Watt X $ × 8 compute bound × 5 memory bound “power bound” × 1.7

Programming GPUs Programming languages (OpenCL, CUDA C, CUDA Fortran, …) Two codes to maintain Highest control, but require complete rewrite Highest performance (if done by expert) Directive based approach (PGI, OpenMP-acc, HMPP, …) Smaller modifications to original code The resulting code is still understandable by Fortran programmers and can be easily modified Possible performance sacrifice (w.r.t. rewrite) No standard for the moment Source-to-source translation (F2C-acc, Kernelgen, …) One source code Can achieve very good performance Legacy codes often don’t map very well onto GPUs Hard to debug

Challenges How to change a wheel on a moving car? GPU hardware and programming models are rapidly changing Several approaches are vendor bound and/or not part of a standard COSMO is also rapidly evolving How to have a single readable code which also compiles onto GPUs? Efficiency may require restructuring or even a change of algorithm Directives jungle Efficient GPU implementation requires… to execute all of COSMO on the GPU enough fine grain parallelism (i.e. threads)

Explore GPU Acceleration (Task 6) Work Source-to-source translation of the whole model (Dmitry Mikushin) Porting of physical parametrizations using PGI directives or f2c-acc (Xavier Lapillone, Cristiano Padrin)  next talk Rewrite of dynamical core for GPUs (Oliver Fuhrer)  talk after next talk

HP2C OPCODE Project Additional proposal to the Swiss HP2C initiative to build an “OPerational COSMO DEmonstrator (OPCODE)” Project proposal accepted Start of project 1 June 2011 until end of 2012 Project lead: André Walser Project resources: second contract with IT company SCS to continue collaboration until end of new positions at MeteoSwiss for about 1 year contribution to position at C2SM contribution from CSCS

HP2C OPCODE Project Main Goals Leverage the research results of the ongoing HP2C COSMO project Prototype implementation of the MeteoSwiss production suite making aggressive use of GPU technology Similar time-to-solution on hardware with substantially lower power consumption and price Cray XT4 (3 cabinets) GPU based hardware (a few rack units)

Thank you!