Priority Project Performance On Massively Parallel Architectures (POMPA) Nice to meet you! COSMO GM10, Moscow.

Slides:



Advertisements
Similar presentations
OpenMP Optimization National Supercomputing Service Swiss National Supercomputing Center.
Advertisements

CSE431 Chapter 7A.1Irwin, PSU, 2008 CSE 431 Computer Architecture Fall 2008 Chapter 7A: Intro to Multiprocessor Systems Mary Jane Irwin (
The Development of Mellanox - NVIDIA GPUDirect over InfiniBand A New Model for GPU to GPU Communications Gilad Shainer.
Distributed Systems CS
Multiprocessors— Large vs. Small Scale Multiprocessors— Large vs. Small Scale.
Scalable Multi-Cache Simulation Using GPUs Michael Moeng Sangyeun Cho Rami Melhem University of Pittsburgh.
A NOVEL APPROACH TO SOLVING LARGE-SCALE LINEAR SYSTEMS Ken Habgood, Itamar Arel Department of Electrical Engineering & Computer Science GABRIEL CRAMER.
Thoughts on Shared Caches Jeff Odom University of Maryland.
Parallel Research at Illinois Parallel Everywhere
GPU System Architecture Alan Gray EPCC The University of Edinburgh.
Prof. Srinidhi Varadarajan Director Center for High-End Computing Systems.
Cache Coherent Distributed Shared Memory. Motivations Small processor count –SMP machines –Single shared memory with multiple processors interconnected.
OpenFOAM on a GPU-based Heterogeneous Cluster
Team Members: Tyler Drake Robert Wrisley Kyle Von Koepping Justin Walsh Faculty Advisors: Computer Science – Prof. Sanjay Rajopadhye Electrical & Computer.
Parallel Computing Overview CS 524 – High-Performance Computing.
1 Interfacing Processors and Peripherals I/O Design affected by many factors (expandability, resilience) Performance: — access latency — throughput — connection.
Efficient Parallelization for AMR MHD Multiphysics Calculations Implementation in AstroBEAR.
Lecture 37: Chapter 7: Multiprocessors Today’s topic –Introduction to multiprocessors –Parallelism in software –Memory organization –Cache coherence 1.
CMSC 611: Advanced Computer Architecture Performance Some material adapted from Mohamed Younis, UMBC CMSC 611 Spr 2003 course slides Some material adapted.
Introduction to Symmetric Multiprocessors Süha TUNA Bilişim Enstitüsü UHeM Yaz Çalıştayı
PP POMPA (WG6) Overview Talk COSMO GM12, Lugano Oliver Fuhrer (MeteoSwiss) and the whole POMPA project team.
Lecture 2 : Introduction to Multicore Computing Bong-Soo Sohn Associate Professor School of Computer Science and Engineering Chung-Ang University.
PMIT-6102 Advanced Database Systems
GPU Programming with CUDA – Accelerated Architectures Mike Griffiths
Computer System Architectures Computer System Software
Early experiences on COSMO hybrid parallelization at CASPUR/USAM Stefano Zampini Ph.D. CASPUR COSMO-POMPA Kick-off Meeting, 3-4 May 2011, Manno (CH)
An approach for solving the Helmholtz Equation on heterogeneous platforms An approach for solving the Helmholtz Equation on heterogeneous platforms G.
Status of Dynamical Core C++ Rewrite (Task 5) Oliver Fuhrer (MeteoSwiss), Tobias Gysi (SCS), Men Muhheim (SCS), Katharina Riedinger (SCS), David Müller.
Parallel Programming Models Jihad El-Sana These slides are based on the book: Introduction to Parallel Computing, Blaise Barney, Lawrence Livermore National.
Lecture 2 : Introduction to Multicore Computing
Parallel Applications Parallel Hardware Parallel Software IT industry (Silicon Valley) Users Efficient Parallel CKY Parsing on GPUs Youngmin Yi (University.
The WRF Model The Weather Research and Forecasting (WRF) Model is a mesoscale numerical weather prediction system designed for both atmospheric research.
Pursuing Faster I/O in COSMO POMPA Workshop May 3rd 2010.
Architectural Support for Fine-Grained Parallelism on Multi-core Architectures Sanjeev Kumar, Corporate Technology Group, Intel Corporation Christopher.
Multi-core Programming Introduction Topics. Topics General Ideas Moore’s Law Amdahl's Law Processes and Threads Concurrency vs. Parallelism.
Porting the physical parametrizations on GPUs using directives X. Lapillonne, O. Fuhrer, Cristiano Padrin, Piero Lanucara, Alessandro Cheloni Eidgenössisches.
1 Computer Architecture Research Overview Rajeev Balasubramonian School of Computing, University of Utah
Introduction, background, jargon Jakub Yaghob. Literature T.G.Mattson, B.A.Sanders, B.L.Massingill: Patterns for Parallel Programming, Addison- Wesley,
Experience with COSMO MPI/OpenMP hybrid parallelization Matthew Cordery, William Sawyer Swiss National Supercomputing Centre Ulrich Schättler Deutscher.
GPU in HPC Scott A. Friedman ATS Research Computing Technologies.
Chapter 2 Parallel Architecture. Moore’s Law The number of transistors on a chip doubles every years. – Has been valid for over 40 years – Can’t.
Taking the Complexity out of Cluster Computing Vendor Update HPC User Forum Arend Dittmer Director Product Management HPC April,
4.2.1 Programming Models Technology drivers – Node count, scale of parallelism within the node – Heterogeneity – Complex memory hierarchies – Failure rates.
PP POMPA (WG6) Overview Talk COSMO GM11, Rome st Birthday.
NIH Resource for Biomolecular Modeling and Bioinformatics Beckman Institute, UIUC NAMD Development Goals L.V. (Sanjay) Kale Professor.
Mehmet Can Kurt, The Ohio State University Gagan Agrawal, The Ohio State University DISC: A Domain-Interaction Based Programming Model With Support for.
1 06/09/2011, COSMO GM Xavier Lapillonne Porting the physical parametrizations on GPU using directives X. Lapillonne, O. Fuhrer Eidgenössisches Departement.
Spring 2003CSE P5481 Issues in Multiprocessors Which programming model for interprocessor communication shared memory regular loads & stores message passing.
The Cosmic Cube Charles L. Seitz Presented By: Jason D. Robey 2 APR 03.
Manno, , © by Supercomputing Systems 1 1 COSMO - Dynamical Core Rewrite Approach, Rewrite and Status Tobias Gysi POMPA Workshop, Manno,
Department of Computer Science MapReduce for the Cell B. E. Architecture Marc de Kruijf University of Wisconsin−Madison Advised by Professor Sankaralingam.
Status of Dynamical Core C++ Rewrite Oliver Fuhrer (MeteoSwiss), Tobias Gysi (SCS), Men Muhheim (SCS), Katharina Riedinger (SCS), David Müller (SCS), Thomas.
Linear Algebra Libraries: BLAS, LAPACK, ScaLAPACK, PLASMA, MAGMA
+ Clusters Alternative to SMP as an approach to providing high performance and high availability Particularly attractive for server applications Defined.
Outline Why this subject? What is High Performance Computing?
Parallelization Strategies Laxmikant Kale. Overview OpenMP Strategies Need for adaptive strategies –Object migration based dynamic load balancing –Minimal.
3/12/2013Computer Engg, IIT(BHU)1 INTRODUCTION-2.
Computer Organization Yasser F. O. Mohammad 1. 2 Lecture 1: Introduction Today’s topics:  Why computer organization is important  Logistics  Modern.
Tackling I/O Issues 1 David Race 16 March 2010.
Parallel Computers Today Oak Ridge / Cray Jaguar > 1.75 PFLOPS Two Nvidia 8800 GPUs > 1 TFLOPS Intel 80- core chip > 1 TFLOPS  TFLOPS = floating.
CMSC 611: Advanced Computer Architecture Shared Memory Most slides adapted from David Patterson. Some from Mohomed Younis.
Hybrid Parallel Implementation of The DG Method Advanced Computing Department/ CAAM 03/03/2016 N. Chaabane, B. Riviere, H. Calandra, M. Sekachev, S. Hamlaoui.
Introduction to Performance Tuning Chia-heng Tu PAS Lab Summer Workshop 2009 June 30,
Lecture 13 Parallel Processing. 2 What is Parallel Computing? Traditionally software has been written for serial computation. Parallel computing is the.
Conclusions on CS3014 David Gregg Department of Computer Science
Introduction to Parallel Computing: MPI, OpenMP and Hybrid Programming
PT Evaluation of the Dycore Parallel Phase (EDP2)
Multi-Processing in High Performance Computer Architecture:
Hybrid Programming with OpenMP and MPI
Facts About High-Performance Computing
Presentation transcript:

Priority Project Performance On Massively Parallel Architectures (POMPA) Nice to meet you! COSMO GM10, Moscow

Overview Motivation COSMO code (as seen by computer engineer) Important Bottlenecks Memory bandwidth Scaling I/O POMPA overview

Motivation What can you do with more computational power? # EPS members (x 2) Resolution (x 1.25) Lead time (x 2) Model complexity (x 2)

Motivation How to increase computational power? Efficiency Algorithm Computer POMPA

Motivation Moore’s law has held since 1970’s and will probably continue to hold Up to now we didn’t need to worry too much about adapting our codes, why should we worry now? ?

Current HPC Platforms Research system: Cray XT5 – “Rosa” 3688 AMD hexa-core 2.4 GHz (212 TF) 28.8 TB DDR2 RAM 9.6 GB/s interconnect bandwidth Operational system: Cray XT4 – “Buin” 264 AMD quad-core 2.6 GHz (4.6 TF) 2.1 TB DDR RAM 7.6 GB/s interconnect bandwidth Old system: Cray XT3 – “Palu” 416 AMD dual-core 2.6 GHz (5.7 TF) 0.83 TB DDR RAM 7.6 GB/s interconnect bandwidth Source: CSCS

The Thermal Wall Power ~ Voltage 2 × Frequency ~ Frequency 3 Clock frequency will not follow Moore’s Law! Source: Intel

Moore’s Law Reinterpreted Number of cores doubles every year while clock speed decreases (not increases) Source: Wikipedia

What are transistors used for? AMD Opteron (single-core) Source: Advanced Micro Devices Inc. memory (latency avoidance) load/store/control (latency tolerance) memory and I/O interface

The Memory Gap Memory speed only doubles every 6 years! Source: Hennessy and Patterson, 2006

“Brutal Facts of HPC” Massive concurrency – increase in number of cores, stagnant or decreasing clock frequency Less and “slower” memory per thread – memory bandwidth per insruction/second and thread will decrease, more complex memory hierarchies Only slow improvements of inter-processor and inter-thread communication – interconnect bandwidth will improve only slowly Stagnant I/O sub-systems – technology for long-term data storage will stagnate compared to compute performance Resilience and fault tolerance – mean time to failure of massively parallel system may be short as compared to time to solution of simulation, need fault tolerant software layers We will have to adapt our codes to exploit the power of future HPC architectures!  Source: HP2C

Why a new Priority Project? Efficient codes may enable new science and save money for operations We need to adapt our codes to efficiently run on current / future massively parallel architectures! Great opportunity to profit from the momentum and knowhow generated by the HP2C or G8 projects and use synergies (e.g. ICON). Consistent with goals of the COSMO Science Plan and similar activities in other consortia.

COSMO Code How would a computer engineer look at the COSMO code?

COSMO Code 227’389 lines of Fortran 90 code % Code Lines% Runtime (C-2 forecast) active

Dynamics

Key Algorithmic Motifs Stencil computations do k=1,ie do j=1,je do i=1,ie a(i,j,k) = w1 * b(i+1,j,k) + w2 * b(i,j,k) + w3 * b(i-1,j,k) end do end do end do Tridiagonal solver (vertical, Thomas alogrithm) do j=1,je ! Modify coefficients do k=2,ke do i=1,ie c(i,j,k) = 1.0 / ( b(i,j,k) – c(i,j,k-1) * a(i,j,k) ) d(i,j,k) = ( d(i,j,k) – d(i,j,k-1) * a(i,j,k) ) * c(i,j,k) end do end do ! Back substitution do k=n-1,1,-1 do i=1,ie x(i,j,k) = d(i,j,k) – c(i,j,k) * x(i,j,k+1) end do end do end do

field(ie,je,ke,nt) [in Fortran first is fastest varying] Optimized for minimal computation (pre calculations) Optimized for vector machine Often repeatedly sweeps over the complete grid (bad cache usage) A lot of copy paste for handling different configurations (difficult to maintain) Metric terms and different averaging positions make code complex Code / Data Structures

Parallelization Strategy How do distribute work onto O(1000) cores? 2D-domain decomposition using MPI library calls Example: operational COSMO-2 Total: 520 x 350 x 60 gridpointsPer core: 24 x 16 x 60 gridpoints Exchange information with MPI halo/comp = 0.75

Bottlenecks? What are/will be the main bottlenecks of the COSMO code on current/future massively parallel architectures? Memory bandwidth Scalability I/O

Memory scaling Problem size 102 x 102 x 60 gridpoints (60 cores, similar to COSMO- 2) Keep number of cores constant, vary number of cores/node used Relative Runtime (4 cores = 100%)

HP2C: Feasibility Study Goal: Investigate how COSMO would have to be implemented in order to reach optimal performance on modern processors Tasks understand the code performance model prototype software new software design proposal Company Duration 4 months (3 months of work)

Focus only on dynamical core (fast wave solver) as it… dominates profiles (30% time) contains the key algorithmic motifs (stencils, tridiagonal solver) is manageable size (14’000 lines) can be run stand-alone in a meaningful way correctness of prototype can be verified Feasibility Study: Idea

Feasibility Study: Results Prototype vs. Original

Key Ingredients Reduce number of memory accesses (less precalculation) Change index order from (i,j,k) to (2,k,i/2,j) or (2,k,j/2,i,) cache efficiency in tridiagonal solver don’t load halo into cache Use iterators instead of on the fly array position computations Merge loops in order to reduce the number of sweeps over full domain Vectorize as much as possible of code

GPUs have O(10) higher bandwidth! Source: Prof. Aoki, Tokio Tech

Bottlenecks? What are the main bottlenecks of the COSMO code on current/future massively parallel architectures? Memory bandwidth Scalability I/O

“Weak” scaling Problem size 1142 x 765 x 90 gridpoints (dt = 8s) “COSMO-2” Matt Cordery, CSCS

Strong scaling (small problem) Problem size 102 x 102 x 60 gridpoints (dt = 20s) “COSMO-2”

Improve Scalability? Several approaches can be followed... Improve MPI parallelization Hybrid parallelization (loop level) Hybrid parallelization (restructure code)...

Hybrid Motivation NUMA = Non-Uniform Memory Access Nodes views… Reality

Hybrid Pros / Cons Pros Eliminates domain decomposition at node Automatic memory coherency at node Lower (memory) latency and faster data movement within node Can synchronize on memory instead of barrier Easier on-node load balancing Cons Benefit for memory bound codes questionable Can be hard to maintain

Hybrid: First Results OpenMP on loop level (> 600 directives) Matt Cordery, CSCS linear speedup

Bottlenecks? What are the main bottlenecks of the COSMO code on current/future massively parallel architectures? Memory bandwidth Scalability I/O

The I/O Bottleneck NetCDF I/O is serial and synchronous grib1 output is asynchronous (and probably not in an ideal way) No parallel output exists! Example: Operational COSMO-2 run REF (s)NO OUTPUT (s) DIFF (s) TOTAL (-11%) MPI USER MPI_gather cal_conv_ind organize_output30-3 tautsp2d10

PP-POMPA Performance On Massively Parallel Architectures Goal Prepare COSMO code for emerging massively parallel architectures Timeframe 3 years (Sep – Sep. 2013) Status Draft of project plan has been sent around. STC has approved the project. Next step Kickoff meeting and detailed planning of activities with all participants.

Tasks ① Performance analysis ② Redesign memory layout ③ Improving scalability (MPI, hybrid) ④ Massively parallel I/O ⑤ Adapt physical parametrizations ⑥ Redesign dynamical core ⑦ Explore GPU acceleration ⑧ Update documentation Current COSMO code base New code and programming models  See project plan!

Who is POMPA? DWD (Ulrich Schättler, …) ARPA-SIMC, USAM & CASPUR (Davide Cesari, Stefano Zampini, David Palella, Piero Lancura, Alessandro Cheloni, Pier Francesco Coppola, …) MeteoSwiss, CSCS & SCS (Oliver Fuhrer, Will Sawyer, Thomas Schulthess, Matt Cordery, Xavier Lapillonne, Neil Stringfellow, Tobias Gysi, …) And you?

Questions? Coming to a supercomputer near your soon!