Early Experience with Applications on POWER8 Mike Ashworth, Rob Allan, Rupert Ford, Xiaohu Guo, Mark Mawson, Jianping Meng, Andrew Porter STFC Hartree.

Early Experience with Applications on POWER8 Mike Ashworth, Rob Allan, Rupert Ford, Xiaohu Guo, Mark Mawson, Jianping Meng, Andrew Porter STFC Hartree Centre & STFC Daresbury Laboratory Manish Modani IBM India mike.ashworth@stfc.ac.uk

Early Experience with POWER HPCx First Terascale system in the UK UK National Service for 7 years at Daresbury Lab Upgraded twice: 3, 6, 12 Tflop/s RMax Final system POWER5 with HPS switch: 160 nodes, 2560 cores www.hpcx.ac.uk Daresbury Laboratory had a POWER3 system prior to HPCx http://tardis.dl.ac.uk/computing_history/computing_history.pdf

OpenPOWER Consortium

Hartree Centre PADC “The PADC will help industry and academia take advantage of IBM and NVIDIA’s technological leadership in supercomputing and the Hartree Centre's expertise and experience in delivering solutions to real- world problems.” Dr Peter Allan, Director of the Hartree Centre” 22 nd October 2015

POWER8 System Two nodes: six cores per socket, four sockets per node 24 cores, 2061-3325 MHz 8-way SMT gives 192 virtual cores per node K40 GPU Software: Ubuntu 3.16.0-23 IBM XL Fortran for Linux, V15.1.2 IBM XL C/C++ for Linux, V13.1.2 OpenMPI 1.8.7, MPICH-3.1.4 CUDA-7.0 LLVM-3.8.0 with OpenMP and Libomptargets No interconnect so only single node benchmarks System to be upgraded December 2015

Applications Lattice Boltzmann code using OPS(JM) Ocean Modelling using GungHo(RF, ARP) DL_MESO Lattice Boltzmann(MMo) Incompressible Smooth Particle Hydrodynamics (XG) Direct Numerical Simulation of Turbulence (MA) Jacobi Test Code(MMa)

Lattice Boltzmann simulation See http://www.oerc.ox.ac.uk/projects/opshttp://www.oerc.ox.ac.uk/projects/ops Developed by Jianping Meng and David Emerson @DL Uses the OPS framework: OPS is a high-level framework with associated libraries and pre- processors to generate parallel executables for applications on multi- block structured grids Uses a python translator to convert serial code into parallel code is optimised for e.g. MPI, OpenMP, CUDA, OpenCL, … To be presented at ParCFD 2016

Lattice Boltzmann POWER8 Problem: 2D Taylor-Green vortex Grid size: 4096*4096 Total time step : 500 Code: 2D lattice Boltzmann model with nine discrete velocity (D2Q9) based on the Oxford Parallel for Structured meshes, written in C++. Feature: able to utilise various parallel technique include MPI, OpenMP, CUDA and OpenCL etc. MPI, built with GCC All calculations are double precision

Lattice Boltzmann POWER8 vs. IvyBridge Early POWER8 results POWER8 >24 cores uses SMT SMT shows good speed-up, memory bound code, overlapping memory fetches with computation Intel E5 IvyBridge does not use hyperthreading Intel E5 IvyBridge scales well

GungHo – What and Why? GungHo is a UK Met Office, NERC and STFC project aiming to research, design and develop a new dynamical core suitable for operational, global and regional, weather and climate simulation Computer architectures are in a state of flux with a variety of competing technologies –GungHo is developing code for a computer that does not yet exist –Many cores, accelerators (PCIe or socket), FPGAs… How can we produce maintainable, scientifically- correct code that will perform well on a range of future architectures?

NERC-funded Technology Proof of Concept fund 03/2014 – 02/2015 Collaboration between National Oceanography Centre, Liverpool (NOC) and STFC Investigate the feasibility of applying technology from the GungHo project to ocean modelling Extend the developing GungHo infrastructure to support finite difference on regular, lat-lon grids

Separation of Concerns in GungHo

The Parallel System, Kernel, Algorithm (PSyKAl) Approach… Oceanographer writes the algorithm (top) and kernel (bottom) layers, following certain rules –no need to worry about relative indexing of various fields –no need to worry about parallelism (algorithm layer deals with logically global field quantities) A code-generation system (PSyclone) generates the PSy middle layer –glues the algorithm and kernels together –incorporates all code related to parallelism

Shallow optimisation stages (serial) (256 x 256 case)

DL_MESO Lattice Boltzmann DL_MESO mesoscale code has DPD and LBE variants Scales well to 24 cores SMT gives additional performance to 96 virtual cores (SMT=4)

Jacobi Test Code Developed by several authors in the CCP-ASEArch project Benchmark for 3D Jacobi solver Available in MPI, OpenMP, CUDA, OpenCL and OpenACC Iterates the 7 points stencil on a cuboid grid. The initial grid state is an eigenvector of the Jacobi smoother sin(pi*kx*x/L)*sin(pi*ky*y/L)sin(pi*kz*z/L) Available from the CCP-Forge repository https://ccpforge.cse.rl.ac.uk/gf/project/asearchtest/

Jacobi Test Code on POWEER8

Incompressible SPH Real engineering problems are 3-D, very large and/or act on multiple scales. Very large particle numbers and high resolutions are required for high accuracy = 100+ million particles ISPH Code should be able to scale on PetaFlops HPC platforms, with consideration of software development for Exascale. The incompressible SPH method with projection-based pressure correction & SHIFTING has been shown to be highly accurate and stable for many free-surface flows

256 partitions at t =1256 partitions at t = 5720∆t ISPH Domain Decomposition What does this look like when we run it for a wet-bed dam break? An example of a violent nonlinear flow requiring highly adaptive domain decomposition across empty space: Colours denote partitions

Industrial Applications Flow Structure Impact modeling Large Rigid Flotsam (i.e. shipping container) Wave Energy Devices: Manchester Bobber Laser Cutting Applications Environment flow simulation

ISPH Performance ISPH 5M particles, strong scaling ARCHER Cray XC30 each node 2x 2.7 GHz, 12-core E5-2697 v2 (Ivy Bridge) Scales well to 24 cores, SMT helps at 96 virtual cores (SMT=4)

SBLI Direct Numerical Simulation of Turbulence Direct Numerical Simulation (DNS) of Turbulence Written by Prof Neil Sandham et al, University of Southampton Shock-Boundary Layer Interaction Channel flow benchmark Scalable to billions of points, millions of CPUs Standard Fortran + MPI, no external libraries Finite difference with grid partitioning and halo exchange

SBLI performance SBLI scales well to 12 cores; then probably memory bound

Summary STFC Hartree Centre is an established PADC We have started porting codes to the POWER8+GPU architecture Scaling is good to 24 cores; SMT helps with some codes –O3 –qhot –qtune=pwr8 –qarch=pwr8 Still looking at Comparison with other CPUs Checking memory performance using STREAM Vectorisation and FMA generation GPU codes, esp. OpenACC etc. LLVM compiler

If you have been … … thank you for listening Mike Ashworth mike.ashworth@stfc.ac.uk http://www.stfc.ac.uk/scd

Early Experience with Applications on POWER8 Mike Ashworth, Rob Allan, Rupert Ford, Xiaohu Guo, Mark Mawson, Jianping Meng, Andrew Porter STFC Hartree.

Similar presentations

Presentation on theme: "Early Experience with Applications on POWER8 Mike Ashworth, Rob Allan, Rupert Ford, Xiaohu Guo, Mark Mawson, Jianping Meng, Andrew Porter STFC Hartree."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Early Experience with Applications on POWER8 Mike Ashworth, Rob Allan, Rupert Ford, Xiaohu Guo, Mark Mawson, Jianping Meng, Andrew Porter STFC Hartree.

Similar presentations

Presentation on theme: "Early Experience with Applications on POWER8 Mike Ashworth, Rob Allan, Rupert Ford, Xiaohu Guo, Mark Mawson, Jianping Meng, Andrew Porter STFC Hartree."— Presentation transcript:

Similar presentations

About project

Feedback