Www.bsc.es Belgrade, 26 September 2014 George S. Markomanolis, Oriol Jorba, Kim Serradell Overview of on-going work on NMMB HPC performance at BSC.

Slides:

Advertisements

Similar presentations

Barcelona Supercomputing Center. The BSC-CNS objectives: R&D in Computer Sciences, Life Sciences and Earth Sciences. Supercomputing support to external.

Advertisements

Accelerators for HPC: Programming Models Accelerators for HPC: StreamIt on GPU High Performance Applications on Heterogeneous Windows Clusters

OpenMP Optimization National Supercomputing Service Swiss National Supercomputing Center.

MINJAE HWANG THAWAN KOOBURAT CS758 CLASS PROJECT FALL 2009 Extending Task-based Programming Model beyond Shared-memory Systems.

Parallel Processing with OpenMP

Introduction to Openmp & openACC

Introductions to Parallel Programming Using OpenMP

Summary Background –Why do we need parallel processing? Applications Introduction in algorithms and applications –Methodology to develop efficient parallel.

Presented by Rengan Xu LCPC /16/2014

Parallel/Concurrent Programming on the SGI Altix Conley Read January 25, 2007 UC Riverside, Department of Computer Science.

Benefits of sampling in tracefiles Harald Servat Program Development for Extreme-Scale Computing May 3rd, 2010.

Early Adopter: ASU - Intel Collaboration in Parallel and Distributed Computing Yinong Chen, Eric Kostelich, Yann-Hang Lee, Alex Mahalov, Gil Speyer, and.

1 The VAMPIR and PARAVER performance analysis tools applied to a wet chemical etching parallel algorithm S. Boeriu 1 and J.C. Bruch, Jr. 2 1 Center for.

Performance Comparison of Pure MPI vs Hybrid MPI-OpenMP Parallelization Models on SMP Clusters Nikolaos Drosinos and Nectarios Koziris National Technical.

Project Proposal (Title + Abstract) Due Wednesday, September 4, 2013.

Programming the Cell Multiprocessor Işıl ÖZ. Outline Cell processor – Objectives – Design and architecture Programming the cell – Programming models CellSs.

Computer System Architectures Computer System Software

High Performance Computation --- A Practical Introduction Chunlin Tian NAOC Beijing 2011.

Budapest, November st ALADIN maintenance and phasing workshop Short introduction to OpenMP Jure Jerman, Environmental Agency of Slovenia.

Early experiences on COSMO hybrid parallelization at CASPUR/USAM Stefano Zampini Ph.D. CASPUR COSMO-POMPA Kick-off Meeting, 3-4 May 2011, Manno (CH)

Shared Memory Parallelization Outline What is shared memory parallelization? OpenMP Fractal Example False Sharing Variable scoping Examples on sharing.

1CPSD NSF/DARPA OPAAL Adaptive Parallelization Strategies using Data-driven Objects Laxmikant Kale First Annual Review October 1999, Iowa City.

Performance Evaluation of Hybrid MPI/OpenMP Implementation of a Lattice Boltzmann Application on Multicore Systems Department of Computer Science and Engineering,

Slides Prepared from the CI-Tutor Courses at NCSA By S. Masoud Sadjadi School of Computing and Information Sciences Florida.

This module was created with support form NSF under grant # DUE Module developed by Martin Burtscher Module B1 and B2: Parallelization.

Prospector : A Toolchain To Help Parallel Programming Minjang Kim, Hyesoon Kim, HPArch Lab, and Chi-Keung Luk Intel This work will be also supported by.

OpenMP in a Heterogeneous World Ayodunni Aribuki Advisor: Dr. Barbara Chapman HPCTools Group University of Houston.

Judit Giménez, Juan González, Pedro González, Jesús Labarta, Germán Llort, Eloy Martínez, Xavier Pegenaute, Harald Servat Brief introduction.

Slide 1 Auburn University Computer Science and Software Engineering Scientific Computing in Computer Science and Software Engineering Kai H. Chang Professor.

ICOM 5995: Performance Instrumentation and Visualization for High Performance Computer Systems Lecture 7 October 16, 2002 Nayda G. Santiago.

TRACEREP: GATEWAY FOR SHARING AND COLLECTING TRACES IN HPC SYSTEMS Iván Pérez Enrique Vallejo José Luis Bosque University of Cantabria TraceRep IWSG'15.

Revisiting Kirchhoff Migration on GPUs Rice Oil & Gas HPC Workshop

The WRF Model The Weather Research and Forecasting (WRF) Model is a mesoscale numerical weather prediction system designed for both atmospheric research.

Architectural Support for Fine-Grained Parallelism on Multi-core Architectures Sanjeev Kumar, Corporate Technology Group, Intel Corporation Christopher.

AN EXTENDED OPENMP TARGETING ON THE HYBRID ARCHITECTURE OF SMP-CLUSTER Author ： Y. Zhao 、 C. Hu 、 S. Wang 、 S. Zhang Source ： Proceedings of the 2nd IASTED.

Porting the physical parametrizations on GPUs using directives X. Lapillonne, O. Fuhrer, Cristiano Padrin, Piero Lanucara, Alessandro Cheloni Eidgenössisches.

Federal Department of Home Affairs FDHA Federal Office of Meteorology and Climatology MeteoSwiss Operational COSMO Demonstrator OPCODE André Walser and.

Hybrid MPI and OpenMP Parallel Programming

Stochastic optimization of energy systems Cosmin Petra Argonne National Laboratory.

1/30/2003 BARC1 Profile-Guided I/O Partitioning Yijian Wang David Kaeli Electrical and Computer Engineering Department Northeastern University {yiwang,

Supercomputing ‘99 Parallelization of a Dynamic Unstructured Application using Three Leading Paradigms Leonid Oliker NERSC Lawrence Berkeley National Laboratory.

Towards Exascale File I/O Yutaka Ishikawa University of Tokyo, Japan 2009/05/21.

Leibniz Supercomputing Centre Garching/Munich Matthias Brehm HPC Group June 16.

UAB Dynamic Tuning of Master/Worker Applications Anna Morajko, Paola Caymes Scutari, Tomàs Margalef, Eduardo Cesar, Joan Sorribes and Emilio Luque Universitat.

CSC 7600 Lecture 28 : Final Exam Review Spring 2010 HIGH PERFORMANCE COMPUTING: MODELS, METHODS, & MEANS FINAL EXAM REVIEW Daniel Kogler, Chirag Dekate.

1 The Portland Group, Inc. Brent Leback HPC User Forum, Broomfield, CO September 2009.

Summary Background –Why do we need parallel processing? Moore’s law. Applications. Introduction in algorithms and applications –Methodology to develop.

Belgrade, 25 September 2014 George S. Markomanolis, Oriol Jorba, Kim Serradell Performance analysis Tools: a case study of NMMB on Marenostrum.

August 2001 Parallelizing ROMS for Distributed Memory Machines using the Scalable Modeling System (SMS) Dan Schaffer NOAA Forecast Systems Laboratory (FSL)

Introduction to OpenMP Eric Aubanel Advanced Computational Research Laboratory Faculty of Computer Science, UNB Fredericton, New Brunswick.

MESQUITE: Mesh Optimization Toolkit Brian Miller, LLNL

Experiences with Achieving Portability across Heterogeneous Architectures Lukasz G. Szafaryn +, Todd Gamblin ++, Bronis R. de Supinski ++ and Kevin Skadron.

Belgrade, 24 September 2014 George S. Markomanolis, Oriol Jorba, Kim Serradell Introduction to an HPC environment.

Parallelization Strategies Laxmikant Kale. Overview OpenMP Strategies Need for adaptive strategies –Object migration based dynamic load balancing –Minimal.

Issues on the operational cluster 1 Up to 4.4x times variation of the execution time on 169 cores Using -O2 optimization flag Using IBM MPI without efficient.

A Simulation Framework to Automatically Analyze the Communication-Computation Overlap in Scientific Applications Vladimir Subotic, Jose Carlos Sancho,

HPC HPC-5 Systems Integration High Performance Computing 1 Application Resilience: Making Progress in Spite of Failure Nathan A. DeBardeleben and John.

Motivation: dynamic apps Rocket center applications: –exhibit irregular structure, dynamic behavior, and need adaptive control strategies. Geometries are.

CEPBA-Tools experiences with MRNet and Dyninst Judit Gimenez, German Llort, Harald Servat

A Dynamic Tracing Mechanism For Performance Analysis of OpenMP Applications - Caubet, Gimenez, Labarta, DeRose, Vetter (WOMPAT 2001) - Presented by Anita.

NGS computation services: APIs and.

HPC University Requirements Analysis Team Training Analysis Summary Meeting at PSC September Mary Ann Leung, Ph.D.

OMP2MPI A. Saà-Garriga et al., CAIAC (UAB)HIP3ES 1/28 A. Saà-Garriga, D. Castells-Rufas and J. Carrabina {Albert.saa, David.castells,

OpenMP Lab Antonio Gómez-Iglesias Texas Advanced Computing Center.

An Update on Accelerating CICE with OpenACC

Introduction to Parallel Computing: MPI, OpenMP and Hybrid Programming

Accelerating MapReduce on a Coupled CPU-GPU Architecture

CMAQ PARALLEL PERFORMANCE WITH MPI AND OpenMP George Delic, Ph

Summary Background Introduction in algorithms and applications

Parallel Programming in C with MPI and OpenMP

Presentation transcript:

Belgrade, 26 September 2014 George S. Markomanolis, Oriol Jorba, Kim Serradell Overview of on-going work on NMMB HPC performance at BSC

2 Outline Introduction to OmpSs programming model Experiences with OmpSs Future work

3 NMMB/BSC-CTM More than lines of Fortran code for the main core NMMB/BSC-CTM is used operationally for the dust forecast center in Barcelona NMMB is the operational model of NCEP The general purpose is to improve its scalability and the simulation resolution

OmpSs Introduction Parallel Programming Model - Build on existing standard: OpenMP - Directive based to keep a serial version - Targeting: SMP, clusters, and accelerator devices - Developed in Barcelona Supercomputing Center (BSC) Mercurium source-to-source compiler Nanos++ runtime system

OmpSs Example

6 Roadmap to OmpSs NMMB is based on the Earth System Modelling Framework (ESMF) The current ESMF release (v3.1) is not supporting threads However, the development version of NMMB uses ESMF v6.3 Post-process broke because of some other issues (which will be fixed) The new version of NMMB with OmpSs support has been compiled on MareNostrum and MinoTauro

Performance Analysis of NMMB/BSC-CTM Model

8 Zoom between EBI solvers The useful functions call between two EBI solvers The first two dark blue areas are horizontal diffusion calls and the light dark is advection chemistry.

9 Horizontal diffusion We zoom on horizontal diffusion and the calls that follow Horizontal diffusion (blue colour) has load imbalance

Experiences with OmpSs

11 Objectives Trying to apply OmpSs on a real application Applying incremental methodology Identify opportunities Exploring difficulties

12 Horizontal diffusion + communication The horizontal diffusion has some load imbalance There is some computation about packing/unpacking data for the communication buffers (red area) Gather (green colour) and scatter for the FFTs

13 Horizontal diffusion skeleton code The hdiff subroutine has the following loops and dependencies

14 Local optimizations Study the hdiff tasks with 2 threads Loop hdiff3_3 needs 4.7 ms (green colour) Code of hdiff3_3 loop do j=jts_b1,jte_h2 do i=its_b1,ite_h2 hkfx=hkx(i,j)*fmlx(i,j) hkfy=hky(i,j)*fmly(i,j) if(num_tracers_chem>0.and.diff_chem)then do ks=1,num_tracers_chem sx (i,j,ks)=(s (i,j,l,ks)-s (i-1,j,l,ks))*hkfx sy (i,j,ks)=(s (i,j,l,ks)-s (i,j-1,l,ks))*hkfy enddo endif enddo

15 Local optimizations Study the hdiff tasks with 2 threads Loop hdiff3_3 needs 4.7 ms (green colour) Code of hdiff3_3 loop do j=jts_b1,jte_h2 do i=its_b1,ite_h2 hkfx=hkx(i,j)*fmlx(i,j) hkfy=hky(i,j)*fmly(i,j) if(num_tracers_chem>0.and.diff_chem)then do ks=1,num_tracers_chem sx (i,j,ks)=(s (i,j,l,ks)-s (i-1,j,l,ks))*hkfx sy (i,j,ks)=(s (i,j,l,ks)-s (i,j-1,l,ks))*hkfy enddo endif enddo New code if(num_tracers_chem>0.and.diff_chem)then do ks=1,num_tracers_chem do j=jts_b1,jte_h2 do i=its_b1,ite_h2 hkfx=hkx(i,j)*fmlx(i,j) hkfy=hky(i,j)*fmly(i,j) sx (i,j,ks)=(s (i,j,l,ks)-s (i-1,j,l,ks))*hkfx sy (i,j,ks)=(s (i,j,l,ks)-s (i,j-1,l,ks))*hkfy enddo endif enddo

16 Local optimizations Study the hdiff tasks with 2 threads Loop hdiff3_3 needs 4.7 ms (green colour) Paraver trace with the code modification Now the hdiff3_3 needs 0.7 ms, a speedup of 6.7 times!

17 Parallelizing loops Part of hdiff with 2 threads Parallelizing the most important loops We have a speedup of 1.3 by using worksharing

18 Comparison The execution of hdiff subroutine with 1 thread takes 120 ms The execution of hdiff subroutine with 2 threads takes 56 ms, the speedup is 2.14 Overal for 1 hour simulation from seconds (average value) to 9.37 seconds, improvement of 46%.

19 Issues related to communication We study the exch4 subroutine (red colour) The useful function of exch4 has some computation The communication creates a pattern and the duration of the MPI_Wait calls can vary

20 Issues related to communication Big load imbalance because message order There is also some computation

21 Taskify subroutine exch4 We observe the MPI_Wait calls in the first thread In the same moment the second thread does the necessary computation and overlaps the communication

22 Taskify subroutine exch4 The total execution of exch4 subrouting with 1 thread The total execution of exch4 subrouting with 2 threads With 2 threads the speedup is 88% (more improvements have been identified)

23 Incremental methodology with OmpSs Taskify the loops Start with 1 thread, use if(0) for serializing tasks Test that dependencies are correct (usually trial and error) Imagine an application crashing after adding 20+ new pragmas (true story) Do not parallelize loops that do not contain significant computation

24 Remarks The incremental methodology is important for less overhead in the application OmpSs can be applied on a real application but is not straightforward It can achieve pretty good speedup, depending on the case Overlapping communication with computation is a really interesting topic We are still in the beginning but OmpSs seems promising

Code vectorization

MUST - MPI run time error detection

2. Experimental performance analysis tool

NEMO – Performance analysis with experimental tool 28 Study of the poor performance: Lines of tranxt.f90 (tra_nxt_fix), lines of step.f90 (stp), and lines 196, 209, , 267, 282 of dynzdf_imp.f90 (dyn_zdf_imp) Credits: Harald Servat

3. Exploring directions

30 Future directions Collaborate with database experts at BSC Rethink big, roadmap for European technologies in hardware and networking for big data Exascale IO

Future Work

32 Future improvements One of the main functions of the application is the EBI solver (run_ebi). There is a problem with global variables that make the function not reentrant. Refactoring of the code is needed. Porting more code to OmpSs and investigate MPI calls as tasks Some computation is independent to the model's layers or to tracers. OpenCL kernels are going to be developed to test the performance on accelerators Testing versioning scheduler The dynamic load balancing library should be studied further ( More info at ECMWF HPC workshop 2014

PRACE schools 33 PATC Course: Parallel Programming Workshop, October 2014 (BSC) PATC Course: Earth Sciences Simulation Environments, December 2014 (BSC) Google for “PATC PRACE”

Thank you! Questions? Acknowledgements: Jesus Labarta, Rosa M. Badia, Judit Gimenez, Roger Ferrer Ibáñez, Victor López, Xavier Teruel, Harald Servat, BSC support 34