Second Conference of Research Software Engineers 7-8 September 2017

Separation of Concerns A route to maintainability in large, parallel scientific codes
Second Conference of Research Software Engineers 7-8 September 2017 Over the next 15 minutes I hope to provide a brief overview of how modern trends in HPC systems are affecting the Met Office’s current software. I will then go on to outline how we hope to continue to take advantage of changing technology. © Crown Copyright 2017, Met Office

The Unified Model Consists of: Atmosphere, Ocean, Sea ice, Land surface “Same” code for weather and climate Much more science in a climate run Project started in 1990 Global grid evolved from 150km to 10km Four day forecast as accurate as a one day forecast 30 years ago The primary model used by the Met Office is the “Unified Model”. “Unified” in this case refers to the fact that the same underlying software is used for both short range weather forecasting and long range probabilistic climate prediction. Different science packages are switched in and out depending on need but the core remains the same. It is also unified in the sense of being constructed of many sub-models coupled together. Again these different components may be switched in and out as required. It is not uncommon to run just the atmosphere component which is the part we will be discussing today. This has served us well for almost 30 years and brought us a long way. Grid resolution has improved by a factor of 10 and a torrent of new science has been added. All this contributes to 4 day forecasts now being as accurate as one day forecasts when it started. So why not continue to develop it? © Crown Copyright 2017, Met Office

Why we can’t go on as we are Heat death of the MHz war
“If you were ploughing a field, which would you rather use: Two strong oxen or 1024 chickens?” attributed to Seymour Cray You may be familiar with this quip attributed to Seymour Cray. It is undeniably glib and is absolutely correct in a perfect world. In fact lets not complicate life with two oxen, lets just have the one. Sadly we don’t live in a perfect world. Ultimately this was a rather desperate attempt to deny the reality of the situation which is that the two strong oxen required to be equivalent to 1024 modern chickens would be larger than their barn and require their own body weight in feed every 5 minutes. Intel were aware of this problem back in the Pentium era. Here is a plot they produced at that time. It indicates that if power dissipation in microprocessors continued to increase as it had been doing our desktop workstations would have surpassed the heat dissipation requirements of a rocket nozzle some time ago. Obviously this didn’t happen. One of the reasons for this is that chip designers stopped the doomed pursuit of ever higher clock speeds and parlayed Moore’s Law’s extra transistors into multiple cores instead. This meant they could continue to take advantage of ever shrinking feature size without melting a hole through the desk. But it does mean that the free lunch of ever increasing clocks is over. We are looking forward to a world not of 1024 chickens, but one of bees. Clearly we will need extreme scaling ability in the future and the UM tops out quite quickly. © Crown Copyright 2017, Met Office

Why we can’t go on as we are Pole position
There are none technical issues as well. I invite you to consider exhibit A. The UM uses a Cartesian mesh. We like these as they are structured and all sorts of assumptions can be made which allow for efficient computation. Not least of these being that if I add 1 to X, I will be in the next horizontally neighbouring cell. Unfortunately at the pole the mesh points tend towards infinite density. The techniques used to deal with this mathematical calamity tend to scale poorly. As previously mentioned poor scaling is increasingly an issue. To tackle the pole density problem we have chosen to adopt the cubesphere. We could have gone with a tetrahedral mesh as is common in computer graphics but our fluid dynamicists like quadrilateral elements. This mesh gives us that without infinitely dense poles. It either has no poles or six depending on how you want to look at it. We could treat this as 6 structured meshes with a lot of special cases around the joins but instead we choose to treat it as an unstructured mesh. The advantage of this is that all the mesh handling code is written for a mesh without structure, meaning that should we decide to change our mind it will support whatever alternative mesh we choose. The down side is that now no assumptions can be made about which elements are next to which other elements. Instead every move up, down, left or right must be performed with an indirected look-up which kills any chance of taking advantage of the processors vector unit. Since an awful lot of the performance of modern processors comes from the vector unit this is a considerable loss. To try and give the vector unit something to work with and reclaim some of that performance we extrude our mesh in height, making it contiguous in altitude. This should improve the work to indirection ratio. © Crown Copyright 2017, Met Office

Why we can’t go on as we are Too much technology
Let’s take a brief moment to consider the current technological landscape. We have stalwart standards “MPI” for distributed memory and “OpenMP” for shared memory. Then nvidia came along with CUDA for programming their GPUs. Obviously it suffered from being tied to one manufacturer’s hardware so a generic equivalent in the form of OpenACC came along. Then there are exciting potential hardware platforms such as ARM and Knights of the Round Table. Trying to exploit all of these, along with whatever else may come along, to their best advantage is the sort of problem which software engineers relish but which are the source of mad scientists. Each time a new platform or new technology has to be supported the code is modified to take advantage. The problem is that this invariably obfuscates what the code is doing as it is re-written, mangled and garnished with directives. © Crown Copyright 2017, Met Office

Enter LFRic All new technical infrastructure
Finite element dynamical core Support existing science Operational trials start 2020 Enter service by 2023 That’s Lewis Fry Richardson, the sternest man in meteorology and the first man to try numerical weather forecasting. It wasn’t particularly expedient taking longer to calculate than to wait for the weather to arrive. However he did prove that it was possible and it is his name we’ve mangled for the project to replace the UM. The primary requirement is for the replacement to be scientifically no worse than the UM. The secondary requirement is that it be fit for the next 30 years. A large part of that is better scaling characteristics which have driven a number of choices including the move to a finite element scheme for the dynamics instead of the current finite difference. We are also taking the opportunity to sweep away technical debt by building this thing anew from the ground up. Of course a whole bag of new debt will be created but improvement is always by iteration. While we would like to see all the science rewritten as well, realistically that isn’t going to happen. At least not straight away. So we have to be able to interoperate with existing science. Which is exclusively finite difference in nature. This is an ambitious project so the time line will probably slip. © Crown Copyright 2017, Met Office

How can we fix this? Loose coupling and tight cohesion
Basics Loose coupling and tight cohesion Defined and enforced interfaces Data scoping and hiding Technical testing Source transformation and generation None of the techniques we are using to aid us in our quest for Nirvana are novel in themselves. In fact most are basic first year computer science concepts. Units of code are tightly cohesive which is to say they should do one thing and one thing only. The are also loosely coupled, so no passing values by side effect. Different functional units communicate with each other through well defined and enforced interfaces. Thus anything may be rewritten and as long as it maintains the same interface no one will be any the wiser. Data which is only needed internally to a component is only visible or accessible within that component. This prevents other components monkeying with internal state causing it to change under our feet. It also prevents other components making assumptions about how something works and coming to rely on it working that way. An extension of interfaces mentioned above. We have also introduced unit testing. Hitherto there has been functional testing where synthetic starting conditions are used and the evolution of the model observed. Things like the “warm bubble” and the “cold bubble” tests. This is essential but not sufficient. Unit testing allows us to build confidence in correctness from the bottom up rather than the top down. This can make bug hunting a lot easier. The only technique not widely used we are pursuing is source transformation and generation. Generation, where source is created from whole cloth, is used to interface different levels of human written source. It is a repetitious process, uninteresting yet prone to errors. Have a machine do it. Transformation is used to adapt human written source to the generated source. In fact generation is used wherever we end up with a lot of boilerplate. Loading namelists is an example of this. Quite a lot of processing is involved to present a slick namelist interface which has to be repeated for each namelist. That amounts to a lot of code identical apart from variable names. An ideal candidate for simple template based generation. This has nothing to do with C++ and everything to do with dropping variable names into a pre-written source file. © Crown Copyright 2017, Met Office

PSyKAl Parallel System, Kernel, Algorithm
Driver (schedule) Algorithm spec (fields) Algorithm (fields) PSy layer Kernel spec (columns) Kernel (columns) Infrastructure – Mesh, Partition, F.E., … We have developed a tortured contraction, “PSyKAl” or Parallel System, Kernel, Algorithm. This splits the model into 3 primary layers and driver to hold it all together. The driver describes which algorithms are to be called and in which order. It handles set up, tear down and the looping required to iterate through the needed timespan for the model run. The algorithm layer deals only in whole fields. No access is available to the field data. Fields are acted upon by calling one or more “kernels”. A kernel, in contrast, deals only with columns of data and knows nothing of fields. Remember this is done in order to have a loop over contiguous data at the bottom in order to take advantage of vector units. In between is the Parallel System (PSy) layer which is responsible for unpacking the fields and looping over their contents, passing columns to the kernels. This on its own would be a powerful tool to managing complexity and maintenance issues even though it is really just an implementation of the first year techniques we covered earlier. However lets fly in the special sauce which we hope will bring even greater benefit. We use the PSyclone tool (also being presented at this conference) to perform source to source transformation and generation. It takes in the algorithm and kernel sources, generates source which maps between them and rewrites the algorithms to call this code. The generated PSy layer makes use of our infrastructure to provide support for such things as mesh, partitioning, finite element. It also uses libraries for I/O and comms. Psyclone is also responsible for enacting any optimisation scripts which may have been provided. These are additional transformations applied to the source. For example such a script might identify loops over field data and impose OpenMP directives on them having first called into the infrastructure to colour any data which stores values on shared entities. © Crown Copyright 2017, Met Office

Why Bother? Re-use algorithms and kernels in different models
Write what is meant, not what is performant Reduce conflict between science and optimisation When what is performant changes, what is written is not Once identified optimisations can be applied universally and transparently Less likely to miss opportunities to implement an optimisation Changes in infrastructure not visible to science As long as API and functionality is maintained, what happens underneath is immaterial. Clearly additional complexity is added to both the build and development process by adopting this approach so why do we think it’s a good idea? We envisage thoughtfully scoped algorithms and kernels being re-used in a number of different models. Currently a number of things which might fruitfully be thought of as different are combined in the UM because they share some code. This feels like the tale wagging the dog and really the infrastructure should support re-use rather than having the developer do it. For instance we might like to separate out all the various scientific components such as chemistry into their own, stand alone, model. This could potentially aid development as it can be performed on a limited and relevant subset of the whole codebase. Easier to see what’s going on and quicker compile-test cycles. From the scientist’s point of view it should allow them to write their code in the most straight forward way and it should stay that way. Currently the optimisation team will sometimes come in and rewrite sections of code for performance reasons leaving the original authors surprised and potentially annoyed. Furthermore when the target changes, as it does on a regular basis, and the optimisation team need to rework code for that new platform they don’t have to undo their previous changes. Because the system is semi-automated it gives optimisation a speculative quality. Once profiling has identified a bottle-neck and an optimisation is implemented as a suitable script that optimisation is applied everywhere it is appropriate. For instance if loop blocking is found to be valuable it is applied to all loops with no further expenditure of effort. This should take a lot of the grind out of optimisation and mean that potentially valuable opportunities to optimise are less likely to be missed. Continuing the themes of smoothing workflow and optimisation, the infrastructure on which the model sits can be rewritten and optimised without interfering with development. Without notice even, except hopefully for greater robustness and speed. This approach should help us to better enforce interfaces between code modules and other coding practices as checks may be performed as part of the transformation process. Dangerous patterns can be rejected at this point rather than as a mysterious bug. © Crown Copyright 2017, Met Office

Does it blend? Scaling smoke, don’t breathe this!
Machine has 6720 Broadwell nodes Incomplete dynamical core with solver 1/10 scaled Earth with 20 levels 32-bit limit in comms library Run out of numbers and therefore work 6144 nodes is cores Almost the whole machine Science written as serial code Parallelism added programmatically As part of the most recent procurement the Met Office took delivery of this monster. It is currently ranked 11 in the top 500 and boasts Broadwell cores. During the acceptance testing process we had the opportunity to run on the whole machine, so we did. Although our dynamical core is not yet complete it does have a working solver which is a time critical component. It was used on a 1/10 scale Earth with 20 levels in the atmosphere. As you can see we got pretty good scaling for smaller problem sizes but it went a bit wrong for the very large runs. This is due to the comms library we are using only supporting 32-bit integers for the number of cell entities. So what you are seeing there is the machine running out of work. The C1152 mesh has cells which, when divided by the cores being used means each one only has 720 cells to work with. At which point the cost of communication begins to swamp the cost of calculation. Still that’s an approximately 800m resolution (equivalent to 8km on a full sized Earth) model running on the best part of quarter of a million cores. We are pretty happy with that. But the single most important thing is that the science code was written as though it were serial. All the parallelisation occurred programmatically behind the scenes. © Crown Copyright 2017, Met Office

Only the Beginning Rewriting kernels to insert OpenACC
Most of the work still to do Optimisation Rewriting kernels to insert OpenACC Cache blocking loops Replace intrinsic maths functions(e.g. matmul) with high performance versions (e.g. Intel maths library) Generating driver layer Looping Sub-looping Diagnostics Even more testing In many ways we are still at the beginning of this project. Most of the work is still to do. There are an almost infinite list of optimisations we might add. Here are a few to give a taste of what we are considering. Only algorithms are transformed at the moment and then only lightly. If that capability were extended to kernels we could potentially have OpenACC or OpenMP SIMD directives added to better guide the generation of vector code. In fact, with separate scripts for each, we could flip between them for fun and profit. Loops could be cache blocked. Remember that this would only happen to the generated source, the original would remain unchanged, looping in the clearest and most obvious fashion. Profiles could be developed for each machine so the code is always optimally blocked for the target platform. And that’s all appropriate loops, not just the ones people thought to look at. Standard Fortran intrinsics such as “matmul” could be substituted with highly optimised vendor supplied library calls. Again, without compromising the readability and portability of the original source. No one likes writing driver code and really, no one should have to. It should be possible to simply describe the looping structure and have all the prognostic and diagnostic fields determined and implemented automatically. I/O could then be hooked in and that’s one less thing anyone has to think about. As such the driver presents an attractive target for future work. All this power and flexibility only increases complexity and therefore the need for rigorous testing. We need to explore more and better ways to test and to test automatically. © Crown Copyright 2017, Met Office

Second Conference of Research Software Engineers 7-8 September 2017

Similar presentations

Presentation on theme: "Second Conference of Research Software Engineers 7-8 September 2017"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Second Conference of Research Software Engineers 7-8 September 2017

Similar presentations

Presentation on theme: "Second Conference of Research Software Engineers 7-8 September 2017"— Presentation transcript:

Similar presentations

About project

Feedback