Download presentation
Presentation is loading. Please wait.
Published byElfrieda Potter Modified over 9 years ago
1
Manno, 4.5..2011, © by Supercomputing Systems 1 1 COSMO - Dynamical Core Rewrite Approach, Rewrite and Status Tobias Gysi POMPA Workshop, Manno, 3.5.2011 Supercomputing Systems AGFon +41 43 456 16 00 Technopark 1Fax +41 43 456 16 10 8005 Zürichwww.scs.ch
2
Manno, 4.5..2011, © by Supercomputing Systems 2 2 Approach
3
Manno, 4.5..2011, © by Supercomputing Systems 3 3 COSMO Dynamical Core Rewrite Challenge Assuming that the COSMO code will continue run on commodity processors in the next couple of years, what is the performance improvement we can achieve by rewriting the dynamical core? Boundary Conditions Do not touch the underlying physical model (i.e. equations that are being solved) –Formulas must remain as they are –Arbitrary ordering of computations, etc. may change –Results must remain ‘identical’ to ‘a very high level of accuracy’ Part of an initiative looking at all parts of the COSMO code. Support Support from & direct interaction with MeteoSwiss, DWD, CSCS, C2SM
4
Manno, 4.5..2011, © by Supercomputing Systems 4 4 Approach Feasibility StudyLib. Design Rewrite TestTune Feasibility Library Test & Tune ~2 Years CPU GPU t You Are Here
5
Manno, 4.5..2011, © by Supercomputing Systems 5 5 Feasibilty Study
6
Manno, 4.5..2011, © by Supercomputing Systems 6 6 Feasibility Study - Overview Get to know the code Understand performance characteristics Find computational motives –Stencil –Tri-Diagonal Solver Implement a prototype code –Relevant part of the dynamical core (Fast Wave Solver, ~30% of total runtime) –Try to optimize for x86 –No MPI parallelization
7
Manno, 4.5..2011, © by Supercomputing Systems 7 7 Feasibility Study - Performance Model Original FORTRAN Code on ‘Monte Rosa’
8
Manno, 4.5..2011, © by Supercomputing Systems 8 8 Feasibility Study - Prototype Implemented in C++ Optimize for memory-bandwidth utilization –Avoid pre-computation, do computation on the fly –Merge loops accessing the common variables –Use iterators rather than full index calculation on 3D grid –Store data contiguous in ‘k-direction’ (vertical columns)
9
Manno, 4.5..2011, © by Supercomputing Systems 9 9 Fast Wave Solver - Speedup The performance difference is NOT due to programming language but due to code optimizations!
10
Manno, 4.5..2011, © by Supercomputing Systems 10 Feasibility Study - Conclusion A performance increase of 2x has been achieved on a representative part of the code Main optimizations identified (for scalar processors) –Avoid pre-calculation whenever possible –Merge loops –Change the storage order to k-first Performance is all about memory bandwidth
11
Manno, 4.5..2011, © by Supercomputing Systems 11 Rewrite
12
Manno, 4.5..2011, © by Supercomputing Systems 12 Design Targets Write a code that Delivers the right results –Dedicated unit-tests & verification framework Apply the performance optimization strategies used in the prototype Can be developed within a year to run on x86 and GPU platforms –Mandatory: support three-level parallelism in a very flexible way Vector processing units (e.g. SSE) Multi-core node (sub-domain) Multiple nodes (domain) - not part of the SCS project –Optional: write one single code that can be compiled to both platforms
13
Manno, 4.5..2011, © by Supercomputing Systems 13 Design Targets Write a code that Facilitates future improvements in terms of –New models / algorithms –Portability to new computer architectures Can and will be integrated by the COSMO consortium into the main branch
14
Manno, 4.5..2011, © by Supercomputing Systems 14 Stencil Library - Ideas It is challenging to develop a stencil library –There is no big chunk of work that can be hidden behind a API call (e.g. matrix multiplication) –The actual update function of the stencil is heavily application specific and performance critical We use a DSEL like approach (Domain Specific Embedded Language) –“Stencil language” embedded in C++ –Separate description of loop logic and update function –During compile time generate optimized C++ code (possible due to C++ meta programming capabilities)
15
Manno, 4.5..2011, © by Supercomputing Systems 15 Stencil Library - Parallelization Parallelization on the node level is done by –Splitting the calculation domain into blocks (IJ-Plane) –Parallelize the work over the blocks –Double buffering avoids concurrency issues
16
Manno, 4.5..2011, © by Supercomputing Systems 16 Stencil Library – Loop Merging The library allows the definition of multiple stages per stencil –Stages are update functions applied consecutively to one block –As a block is typically much smaller than the complete domain we can leverage the caches of the CPU
17
Manno, 4.5..2011, © by Supercomputing Systems 17 Stencil Library – Calculation On The Fly Calculation on the fly is supported using a combination of stages and column buffers –Column buffers are fields with the size of one block local to every CPU core –A first stage writes to a buffer while a second stage consumes the pre-calculated values
18
Manno, 4.5..2011, © by Supercomputing Systems 18 Stencil Code – My Toy Example 1. Naive for k a(k) := b(k) + c(k) end... for k d(k) := a(k-1)*e(-1) + a(k)*e(0) + a(k+1)*e(+1) end... for k f(k) := a(k)*g(k) + d(k) end
19
Manno, 4.5..2011, © by Supercomputing Systems 19 Stencil Code – My Toy Example 2. No pre-calculation for k d(k) := (b(k-1)+c(k-1))*e(-1) + (b(k)+c(k))*e(0) + (b(k+1)+c(k+1))*e(+1) f(k) := (b(k)+c(k))*g(k) + d(k) end 1. Naive for k a(k) := b(k) + c(k) end... for k d(k) := a(k-1)*e(-1) + a(k)*e(0) + a(k+1)*e(+1) end... for k f(k) := a(k)*g(k) + d(k) end
20
Manno, 4.5..2011, © by Supercomputing Systems 20 Stencil Code – My Toy Example 3. Pre-calculation with temporary variables for k z := b(k+1) + c(k+1) d(k) := x*e(-1) + y*e(0) + z*e(+1) f(k) := y*g(k) + d(k) x:=y y:=z end 1. Naive for k a(k) := b(k) + c(k) end... for k d(k) := a(k-1)*e(-1) + a(k)*e(0) + a(k+1)*e(+1) end... for k f(k) := a(k)*g(k) + d(k) end
21
Manno, 4.5..2011, © by Supercomputing Systems 21 Stencil Code – My Toy Example 4. Pre-calculation with column buffer for k a(k) := b(k) + c(k) end for k d(k) := a(k-1)*e(-1) + a(k)*e(0) + a(k+1)*e(+1) f(k) := a(k)*g(k) + d(k) end 1. Naive for k a(k) := b(k) + c(k) end... for k d(k) := a(k-1)*e(-1) + a(k)*e(0) + a(k+1)*e(+1) end... for k f(k) := a(k)*g(k) + d(k) end
22
Manno, 4.5..2011, © by Supercomputing Systems 22 Stencil Code – My Toy Example 5. Pre-calculation with stages & column Buffer Stencil Stage 1 a := b + c Stage 2 d := a*e (k:-1,0,1) Stage 3 f := a*g + d Apply Stencil 1. Naive for k a(k) := b(k) + c(k) end... for k d(k) := a(k-1)*e(-1) + a(k)*e(0) + a(k+1)*e(+1) end... for k f(k) := a(k)*g(k) + d(k) end
23
Manno, 4.5..2011, © by Supercomputing Systems 23 Status
24
Manno, 4.5..2011, © by Supercomputing Systems 24 Status So far the following stencils have been implemented: –Fast wave solver (w bottom boundary initialization missing) –Advection 5 th order advection Bott 2 advection (cri implementation missing) –Complete tendencies –Horizontal Diffusion –Coriolis The next steps are: –Implicit vertical diffusion –Put it all together –Performance optimization
25
Manno, 4.5..2011, © by Supercomputing Systems 25 Discussion Acknowledgements to all our collaborators at C2SM (Center for Climate Systems Modeling) MeteoSwiss DWD (Deutscher Wetterdienst) CSCS
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.