Manno, 4.5..2011, © by Supercomputing Systems 1 1 COSMO - Dynamical Core Rewrite Approach, Rewrite and Status Tobias Gysi POMPA Workshop, Manno, 3.5.2011.

Slides:

Advertisements

Similar presentations

OpenMP Optimization National Supercomputing Service Swiss National Supercomputing Center.

Advertisements

Parallel Jacobi Algorithm Steven Dong Applied Mathematics.

Optimizing Compilers for Modern Architectures Allen and Kennedy, Chapter 13 Compiling Array Assignments.

VLIW Very Large Instruction Word. Introduction Very Long Instruction Word is a concept for processing technology that dates back to the early 1980s. The.

A NOVEL APPROACH TO SOLVING LARGE-SCALE LINEAR SYSTEMS Ken Habgood, Itamar Arel Department of Electrical Engineering & Computer Science GABRIEL CRAMER.

Vector Processing. Vector Processors Combine vector operands (inputs) element by element to produce an output vector. Typical array-oriented operations.

Limits on ILP. Achieving Parallelism Techniques – Scoreboarding / Tomasulo’s Algorithm – Pipelining – Speculation – Branch Prediction But how much more.

Chapter 12 CPU Structure and Function. CPU Sequence Fetch instructions Interpret instructions Fetch data Process data Write data.

Computer Organization and Architecture

1 Coven a Framework for High Performance Problem Solving Environments Nathan A. DeBardeleben Walter B. Ligon III Sourabh Pandit Dan C. Stanzione Jr. Parallel.

OpenFOAM on a GPU-based Heterogeneous Cluster

DISTRIBUTED AND HIGH-PERFORMANCE COMPUTING CHAPTER 7: SHARED MEMORY PARALLEL PROGRAMMING.

CUDA Programming Lei Zhou, Yafeng Yin, Yanzhi Ren, Hong Man, Yingying Chen.

Dutch-Belgium DataBase Day University of Antwerp, MonetDB/x100 Peter Boncz, Marcin Zukowski, Niels Nes.

Shekoofeh Azizi Spring  CUDA is a parallel computing platform and programming model invented by NVIDIA  With CUDA, you can send C, C++ and Fortran.

SEC(R) 2008 Intel® Concurrent Collections for C++ - a model for parallel programming Nikolay Kurtov Software and Services.

Exercise problems for students taking the Programming Parallel Computers course. Janusz Kowalik Piotr Arlukowicz Tadeusz Puzniakowski Informatics Institute.

Lecture 29 Fall 2006 Lecture 29: Parallel Programming Overview.

An approach for solving the Helmholtz Equation on heterogeneous platforms An approach for solving the Helmholtz Equation on heterogeneous platforms G.

OpenMP in a Heterogeneous World Ayodunni Aribuki Advisor: Dr. Barbara Chapman HPCTools Group University of Houston.

Status of Dynamical Core C++ Rewrite (Task 5) Oliver Fuhrer (MeteoSwiss), Tobias Gysi (SCS), Men Muhheim (SCS), Katharina Riedinger (SCS), David Müller.

M. Baldauf, DWD Numerical contributions to the Priority Project ‘Runge-Kutta’ COSMO General Meeting, Working Group 2 (Numerics) Bukarest,

CuMAPz: A Tool to Analyze Memory Access Patterns in CUDA

Parallel Programming Models Jihad El-Sana These slides are based on the book: Introduction to Parallel Computing, Blaise Barney, Lawrence Livermore National.

Computational Design of the CCSM Next Generation Coupler Tom Bettge Tony Craig Brian Kauffman National Center for Atmospheric Research Boulder, Colorado.

Pursuing Faster I/O in COSMO POMPA Workshop May 3rd 2010.

AN EXTENDED OPENMP TARGETING ON THE HYBRID ARCHITECTURE OF SMP-CLUSTER Author ： Y. Zhao 、 C. Hu 、 S. Wang 、 S. Zhang Source ： Proceedings of the 2nd IASTED.

MonetDB/X100 hyper-pipelining query execution Peter Boncz, Marcin Zukowski, Niels Nes.

Porting the physical parametrizations on GPUs using directives X. Lapillonne, O. Fuhrer, Cristiano Padrin, Piero Lanucara, Alessandro Cheloni Eidgenössisches.

Advanced Computer Architecture, CSE 520 Generating FPGA-Accelerated DFT Libraries Chi-Li Yu Nov. 13, 2007.

A Distributed Algorithm for 3D Radar Imaging PATRICK LI SIMON SCOTT CS 252 MAY 2012.

MATRIX MULTIPLY WITH DRYAD B649 Course Project Introduction.

NIH Resource for Biomolecular Modeling and Bioinformatics Beckman Institute, UIUC NAMD Development Goals L.V. (Sanjay) Kale Professor.

Mehmet Can Kurt, The Ohio State University Gagan Agrawal, The Ohio State University DISC: A Domain-Interaction Based Programming Model With Support for.

Towards Exascale File I/O Yutaka Ishikawa University of Tokyo, Japan 2009/05/21.

Database Environment Chapter 2. Data Independence Sometimes the way data are physically organized depends on the requirements of the application. Result:

JAVA AND MATRIX COMPUTATION

1 1 What does Performance Across the Software Stack mean?  High level view: Providing performance for physics simulations meaningful to applications 

Parallelization of likelihood functions for data analysis Alfio Lazzaro CERN openlab Forum on Concurrent Programming Models and Frameworks.

Status of Dynamical Core C++ Rewrite Oliver Fuhrer (MeteoSwiss), Tobias Gysi (SCS), Men Muhheim (SCS), Katharina Riedinger (SCS), David Müller (SCS), Thomas.

Jungpyo Lee Plasma Science & Fusion Center(PSFC), MIT Parallelization for a Block-Tridiagonal System with MPI 2009 Spring Term Project.

Linear Algebra Libraries: BLAS, LAPACK, ScaLAPACK, PLASMA, MAGMA

Processor Structure and Function Chapter8:. CPU Structure  CPU must:  Fetch instructions –Read instruction from memory  Interpret instructions –Instruction.

Deutscher Wetterdienst COSMO-ICON Physics Current Status and Plans Ulrich Schättler Source Code Administrator COSMO-Model.

TI Information – Selective Disclosure Implementation of Linear Algebra Libraries for Embedded Architectures Using BLIS September 28, 2015 Devangi Parikh.

Single Node Optimization Computational Astrophysics.

Priority Project Performance On Massively Parallel Architectures (POMPA) Nice to meet you! COSMO GM10, Moscow.

3/12/2013Computer Engg, IIT(BHU)1 PARALLEL COMPUTERS- 3.

AUTO-GC: Automatic Translation of Data Mining Applications to GPU Clusters Wenjing Ma Gagan Agrawal The Ohio State University.

3/12/2013Computer Engg, IIT(BHU)1 OpenMP-1. OpenMP is a portable, multiprocessing API for shared memory computers OpenMP is not a “language” Instead,

Co-Design Update 12/19/14 Brent Pickering Virginia Tech AOE Department.

Introduction to Performance Tuning Chia-heng Tu PAS Lab Summer Workshop 2009 June 30,

GMProf: A Low-Overhead, Fine-Grained Profiling Approach for GPU Programs Mai Zheng, Vignesh T. Ravi, Wenjing Ma, Feng Qin, and Gagan Agrawal Dept. of Computer.

GPU Acceleration of Particle-In-Cell Methods B. M. Cowan, J. R. Cary, S. W. Sides Tech-X Corporation.

Application of Design Patterns to Geometric Decompositions V. Balaji, Thomas L. Clune, Robert W. Numrich and Brice T. Womack.

1 ”MCUDA: An efficient implementation of CUDA kernels for multi-core CPUs” John A. Stratton, Sam S. Stone and Wen-mei W. Hwu Presentation for class TDT24,

Introduction to Parallel Computing: MPI, OpenMP and Hybrid Programming

John Levesque Director Cray Supercomputing Center of Excellence

PT Evaluation of the Dycore Parallel Phase (EDP2)

Computer Engg, IIT(BHU)

5.2 Eleven Advanced Optimizations of Cache Performance

Performance Evaluation of Adaptive MPI

Nathan Grabaskas: Batched LA and Parallel Communication Optimization

Dycore Rewrite Tobias Gysi.

Compiler Back End Panel

Compiler Back End Panel

Bin Ren, Gagan Agrawal, Brad Chamberlain, Steve Deitz

Memory System Performance Chapter 3

Presentation transcript:

Manno, , © by Supercomputing Systems 1 1 COSMO - Dynamical Core Rewrite Approach, Rewrite and Status Tobias Gysi POMPA Workshop, Manno, Supercomputing Systems AGFon Technopark 1Fax Zürichwww.scs.ch

Manno, , © by Supercomputing Systems 2 2 Approach

Manno, , © by Supercomputing Systems 3 3 COSMO Dynamical Core Rewrite Challenge Assuming that the COSMO code will continue run on commodity processors in the next couple of years, what is the performance improvement we can achieve by rewriting the dynamical core? Boundary Conditions Do not touch the underlying physical model (i.e. equations that are being solved) –Formulas must remain as they are –Arbitrary ordering of computations, etc. may change –Results must remain ‘identical’ to ‘a very high level of accuracy’ Part of an initiative looking at all parts of the COSMO code. Support Support from & direct interaction with MeteoSwiss, DWD, CSCS, C2SM

Manno, , © by Supercomputing Systems 4 4 Approach Feasibility StudyLib. Design Rewrite TestTune Feasibility Library Test & Tune ~2 Years CPU GPU t You Are Here

Manno, , © by Supercomputing Systems 5 5 Feasibilty Study

Manno, , © by Supercomputing Systems 6 6 Feasibility Study - Overview Get to know the code Understand performance characteristics Find computational motives –Stencil –Tri-Diagonal Solver Implement a prototype code –Relevant part of the dynamical core (Fast Wave Solver, ~30% of total runtime) –Try to optimize for x86 –No MPI parallelization

Manno, , © by Supercomputing Systems 7 7 Feasibility Study - Performance Model Original FORTRAN Code on ‘Monte Rosa’

Manno, , © by Supercomputing Systems 8 8 Feasibility Study - Prototype Implemented in C++ Optimize for memory-bandwidth utilization –Avoid pre-computation, do computation on the fly –Merge loops accessing the common variables –Use iterators rather than full index calculation on 3D grid –Store data contiguous in ‘k-direction’ (vertical columns)

Manno, , © by Supercomputing Systems 9 9 Fast Wave Solver - Speedup The performance difference is NOT due to programming language but due to code optimizations!

Manno, , © by Supercomputing Systems 10 Feasibility Study - Conclusion A performance increase of 2x has been achieved on a representative part of the code Main optimizations identified (for scalar processors) –Avoid pre-calculation whenever possible –Merge loops –Change the storage order to k-first Performance is all about memory bandwidth

Manno, , © by Supercomputing Systems 11 Rewrite

Manno, , © by Supercomputing Systems 12 Design Targets Write a code that Delivers the right results –Dedicated unit-tests & verification framework Apply the performance optimization strategies used in the prototype Can be developed within a year to run on x86 and GPU platforms –Mandatory: support three-level parallelism in a very flexible way Vector processing units (e.g. SSE) Multi-core node (sub-domain) Multiple nodes (domain) - not part of the SCS project –Optional: write one single code that can be compiled to both platforms

Manno, , © by Supercomputing Systems 13 Design Targets Write a code that Facilitates future improvements in terms of –New models / algorithms –Portability to new computer architectures Can and will be integrated by the COSMO consortium into the main branch

Manno, , © by Supercomputing Systems 14 Stencil Library - Ideas It is challenging to develop a stencil library –There is no big chunk of work that can be hidden behind a API call (e.g. matrix multiplication) –The actual update function of the stencil is heavily application specific and performance critical We use a DSEL like approach (Domain Specific Embedded Language) –“Stencil language” embedded in C++ –Separate description of loop logic and update function –During compile time generate optimized C++ code (possible due to C++ meta programming capabilities)

Manno, , © by Supercomputing Systems 15 Stencil Library - Parallelization Parallelization on the node level is done by –Splitting the calculation domain into blocks (IJ-Plane) –Parallelize the work over the blocks –Double buffering avoids concurrency issues

Manno, , © by Supercomputing Systems 16 Stencil Library – Loop Merging The library allows the definition of multiple stages per stencil –Stages are update functions applied consecutively to one block –As a block is typically much smaller than the complete domain  we can leverage the caches of the CPU

Manno, , © by Supercomputing Systems 17 Stencil Library – Calculation On The Fly Calculation on the fly is supported using a combination of stages and column buffers –Column buffers are fields with the size of one block local to every CPU core –A first stage writes to a buffer while a second stage consumes the pre-calculated values

Manno, , © by Supercomputing Systems 18 Stencil Code – My Toy Example 1. Naive for k a(k) := b(k) + c(k) end... for k d(k) := a(k-1)*e(-1) + a(k)*e(0) + a(k+1)*e(+1) end... for k f(k) := a(k)*g(k) + d(k) end

Manno, , © by Supercomputing Systems 19 Stencil Code – My Toy Example 2. No pre-calculation for k d(k) := (b(k-1)+c(k-1))*e(-1) + (b(k)+c(k))*e(0) + (b(k+1)+c(k+1))*e(+1) f(k) := (b(k)+c(k))*g(k) + d(k) end 1. Naive for k a(k) := b(k) + c(k) end... for k d(k) := a(k-1)*e(-1) + a(k)*e(0) + a(k+1)*e(+1) end... for k f(k) := a(k)*g(k) + d(k) end

Manno, , © by Supercomputing Systems 20 Stencil Code – My Toy Example 3. Pre-calculation with temporary variables for k z := b(k+1) + c(k+1) d(k) := x*e(-1) + y*e(0) + z*e(+1) f(k) := y*g(k) + d(k) x:=y y:=z end 1. Naive for k a(k) := b(k) + c(k) end... for k d(k) := a(k-1)*e(-1) + a(k)*e(0) + a(k+1)*e(+1) end... for k f(k) := a(k)*g(k) + d(k) end

Manno, , © by Supercomputing Systems 21 Stencil Code – My Toy Example 4. Pre-calculation with column buffer for k a(k) := b(k) + c(k) end for k d(k) := a(k-1)*e(-1) + a(k)*e(0) + a(k+1)*e(+1) f(k) := a(k)*g(k) + d(k) end 1. Naive for k a(k) := b(k) + c(k) end... for k d(k) := a(k-1)*e(-1) + a(k)*e(0) + a(k+1)*e(+1) end... for k f(k) := a(k)*g(k) + d(k) end

Manno, , © by Supercomputing Systems 22 Stencil Code – My Toy Example 5. Pre-calculation with stages & column Buffer Stencil Stage 1 a := b + c Stage 2 d := a*e (k:-1,0,1) Stage 3 f := a*g + d Apply Stencil 1. Naive for k a(k) := b(k) + c(k) end... for k d(k) := a(k-1)*e(-1) + a(k)*e(0) + a(k+1)*e(+1) end... for k f(k) := a(k)*g(k) + d(k) end

Manno, , © by Supercomputing Systems 23 Status

Manno, , © by Supercomputing Systems 24 Status So far the following stencils have been implemented: –Fast wave solver (w bottom boundary initialization missing) –Advection 5 th order advection Bott 2 advection (cri implementation missing) –Complete tendencies –Horizontal Diffusion –Coriolis The next steps are: –Implicit vertical diffusion –Put it all together –Performance optimization

Manno, , © by Supercomputing Systems 25 Discussion Acknowledgements to all our collaborators at C2SM (Center for Climate Systems Modeling) MeteoSwiss DWD (Deutscher Wetterdienst) CSCS