Early experiences on COSMO hybrid parallelization at CASPUR/USAM Stefano Zampini Ph.D. CASPUR COSMO-POMPA Kick-off Meeting, 3-4 May 2011, Manno (CH)

Slides:

Advertisements

Similar presentations

1 Tuning for MPI Protocols l Aggressive Eager l Rendezvous with sender push l Rendezvous with receiver pull l Rendezvous blocking (push or pull)

Advertisements

OpenMP Optimization National Supercomputing Service Swiss National Supercomputing Center.

MINJAE HWANG THAWAN KOOBURAT CS758 CLASS PROJECT FALL 2009 Extending Task-based Programming Model beyond Shared-memory Systems.

Introduction to Openmp & openACC

Enabling Speculative Parallelization via Merge Semantics in STMs Kaushik Ravichandran Santosh Pande College.

TUPEC057 Advances With Merlin – A Beam Tracking Code J. Molson, R.J. Barlow, H.L. Owen, A. Toader MERLIN is a.

SKELETON BASED PERFORMANCE PREDICTION ON SHARED NETWORKS Sukhdeep Sodhi Microsoft Corp Jaspal Subhlok University of Houston.

Scientific Programming OpenM ulti- P rocessing M essage P assing I nterface.

Cc Compiler Parallelization Options CSE 260 Mini-project Fall 2001 John Kerwin.

DISTRIBUTED AND HIGH-PERFORMANCE COMPUTING CHAPTER 7: SHARED MEMORY PARALLEL PROGRAMMING.

Computer Architecture II 1 Computer architecture II Programming: POSIX Threads OpenMP.

1 Parallel Computing—Introduction to Message Passing Interface (MPI)

Introduction to Scientific Computing Doug Sondak Boston University Scientific Computing and Visualization.

Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Parallel Programming in C with MPI and OpenMP Michael J. Quinn.

Introduction to Scientific Computing on Linux Clusters Doug Sondak Linux Clusters and Tiled Display Walls July 30 – August 1, 2002.

The hybird approach to programming clusters of multi-core architetures.

Performance Comparison of Pure MPI vs Hybrid MPI-OpenMP Parallelization Models on SMP Clusters Nikolaos Drosinos and Nectarios Koziris National Technical.

Advanced Hybrid MPI/OpenMP Parallelization Paradigms for Nested Loop Algorithms onto Clusters of SMPs Nikolaos Drosinos and Nectarios Koziris National.

User-Level Process towards Exascale Systems Akio Shimada [1], Atsushi Hori [1], Yutaka Ishikawa [1], Pavan Balaji [2] [1] RIKEN AICS, [2] Argonne National.

High Performance Computation --- A Practical Introduction Chunlin Tian NAOC Beijing 2011.

Budapest, November st ALADIN maintenance and phasing workshop Short introduction to OpenMP Jure Jerman, Environmental Agency of Slovenia.

CC02 – Parallel Programming Using OpenMP 1 of 25 PhUSE 2011 Aniruddha Deshmukh Cytel Inc.

Performance Evaluation of Hybrid MPI/OpenMP Implementation of a Lattice Boltzmann Application on Multicore Systems Department of Computer Science and Engineering,

IFS Benchmark with Federation Switch John Hague, IBM.

Slides Prepared from the CI-Tutor Courses at NCSA By S. Masoud Sadjadi School of Computing and Information Sciences Florida.

Lecture 4: Parallel Programming Models. Parallel Programming Models Parallel Programming Models: Data parallelism / Task parallelism Explicit parallelism.

This module was created with support form NSF under grant # DUE Module developed by Martin Burtscher Module B1 and B2: Parallelization.

MPI3 Hybrid Proposal Description

OpenMP in a Heterogeneous World Ayodunni Aribuki Advisor: Dr. Barbara Chapman HPCTools Group University of Houston.

Venkatram Ramanathan 1. Motivation Evolution of Multi-Core Machines and the challenges Summary of Contributions Background: MapReduce and FREERIDE Wavelet.

The WRF Model The Weather Research and Forecasting (WRF) Model is a mesoscale numerical weather prediction system designed for both atmospheric research.

Architectural Support for Fine-Grained Parallelism on Multi-core Architectures Sanjeev Kumar, Corporate Technology Group, Intel Corporation Christopher.

AN EXTENDED OPENMP TARGETING ON THE HYBRID ARCHITECTURE OF SMP-CLUSTER Author ： Y. Zhao 、 C. Hu 、 S. Wang 、 S. Zhang Source ： Proceedings of the 2nd IASTED.

OpenMP OpenMP A.Klypin Shared memory and OpenMP Simple Example Threads Dependencies Directives Handling Common blocks Synchronization Improving load balance.

A Framework for Elastic Execution of Existing MPI Programs Aarthi Raveendran Tekin Bicer Gagan Agrawal 1.

UPC Applications Parry Husbands. Roadmap Benchmark small applications and kernels —SPMV (for iterative linear/eigen solvers) —Multigrid Develop sense.

Experience with COSMO MPI/OpenMP hybrid parallelization Matthew Cordery, William Sawyer Swiss National Supercomputing Centre Ulrich Schättler Deutscher.

OpenMP – Introduction* *UHEM yaz çalıştayı notlarından derlenmiştir. (uhem.itu.edu.tr)

Application performance and communication profiles of M3DC1_3D on NERSC babbage KNC with 16 MPI Ranks Thanh Phung, Intel TCAR Woo-Sun Yang, NERSC.

A Framework for Elastic Execution of Existing MPI Programs Aarthi Raveendran Graduate Student Department Of CSE 1.

Frank Casilio Computer Engineering May 15, 1997 Multithreaded Processors.

Computer Organization David Monismith CS345 Notes to help with the in class assignment.

Abdelhalim Amer *, Huiwei Lu *, Pavan Balaji *, Satoshi Matsuoka + *Argonne National Laboratory, IL, USA +Tokyo Institute of Technology, Tokyo, Japan Characterizing.

Hybrid MPI and OpenMP Parallel Programming

MATRIX MULTIPLY WITH DRYAD B649 Course Project Introduction.

1 Parallel Programming Aaron Bloomfield CS 415 Fall 2005.

PMI: A Scalable Process- Management Interface for Extreme-Scale Systems Pavan Balaji, Darius Buntinas, David Goodell, William Gropp, Jayesh Krishna, Ewing.

1 Typical performance bottlenecks and how they can be found Bert Wesarg ZIH, Technische Universität Dresden.

Parallelization of likelihood functions for data analysis Alfio Lazzaro CERN openlab Forum on Concurrent Programming Models and Frameworks.

Slide 1 Using OpenACC in IFS Physics’ Cloud Scheme (CLOUDSC) Sami Saarinen ECMWF Basic GPU Training Sept 16-17, 2015.

A new thread support level for hybrid programming with MPI endpoints EASC 2015 Dan Holmes, Mark Bull, Jim Dinan

Shared Memory Parallelism - OpenMP Sathish Vadhiyar Credits/Sources: OpenMP C/C++ standard (openmp.org) OpenMP tutorial (

Belgrade, 26 September 2014 George S. Markomanolis, Oriol Jorba, Kim Serradell Overview of on-going work on NMMB HPC performance at BSC.

MT-MPI: Multithreaded MPI for Many- Core Environments Min Si 1,2 Antonio J. Peña 2 Pavan Balaji 2 Masamichi Takagi 3 Yutaka Ishikawa 1 1 University of.

Parallelization Strategies Laxmikant Kale. Overview OpenMP Strategies Need for adaptive strategies –Object migration based dynamic load balancing –Minimal.

3/12/2013Computer Engg, IIT(BHU)1 PARALLEL COMPUTERS- 2.

3/12/2013Computer Engg, IIT(BHU)1 OpenMP-1. OpenMP is a portable, multiprocessing API for shared memory computers OpenMP is not a “language” Instead,

Uses some of the slides for chapters 3 and 5 accompanying “Introduction to Parallel Computing”, Addison Wesley, 2003.

Parallel Computing Presented by Justin Reschke

SMP Basics KeyStone Training Multicore Applications Literature Number: SPRPxxx 1.

OpenMP Runtime Extensions Many core Massively parallel environment Intel® Xeon Phi co-processor Blue Gene/Q MPI Internal Parallelism Optimizing MPI Implementation.

CPE779: Shared Memory and OpenMP Based on slides by Laxmikant V. Kale and David Padua of the University of Illinois.

Multicore Applications in Physics and Biochemical Research Hristo Iliev Faculty of Physics Sofia University “St. Kliment Ohridski” 3 rd Balkan Conference.

OMP2MPI A. Saà-Garriga et al., CAIAC (UAB)HIP3ES 1/28 A. Saà-Garriga, D. Castells-Rufas and J. Carrabina {Albert.saa, David.castells,

Parallel Objects: Virtualization & In-Process Components

Computer Engg, IIT(BHU)

Integrated Runtime of Charm++ and OpenMP

Introduction to parallelism and the Message Passing Interface

MPJ: A Java-based Parallel Computing System

Non blocking communications in RK dynamics

Presentation transcript:

Early experiences on COSMO hybrid parallelization at CASPUR/USAM Stefano Zampini Ph.D. CASPUR COSMO-POMPA Kick-off Meeting, 3-4 May 2011, Manno (CH)

Outline MPI and multithreading: comm/comp overlap Benchmark on multithreaded halo swapping Non-blocking collectives (NBC library) Scalasca profiling of loop-level hybrid parallelization of leapfrog dynamics.

Multithreaded MPI MPI 2.0 and multithreading: a different initialization MPI_INIT_THREAD(required,provided) –MPI_SINGLE: no support for multithreading (i.e. MPI_INIT) –MPI_FUNNELED: only the master thread can call MPI –MPI_SERIALIZED: each thread can call MPI but one at time –MPI_MULTIPLE: multiple threads can call MPI simultaneously (then MPI performs them in some serial order) Cases of interest: FUNNELED and MULTIPLE

Masteronly (serial comm.) doubleprecision dataSR(ie,je,ke) (iloc=ie-2*ovl, jloc=je-2*ovl) doubleprecision dataARP,dataARS call MPI_INIT_THREAD(MPI_FUNNELED,…) … call MPI_CART_CREATE(…….,comm2d,…) … !$omp parallel shared(dataSR,dataARS,dataARP) … !$omp master initialize derived datatypes for SENDRECV as for pure MPI !$omp end master … !$omp master call SENDRECV(dataSR(..,..,1),1,……,comm2d) call MPI_ALLREDUCE(dataARP,dataARS…,comm2d) !$omp end master !$omp barrier do some COMPUTATION (involving dataSR, dataARS and dataARP) !$omp end parallel … call MPI_FINALIZE(…)

Masteronly (with overlap) doubleprecision dataSR(ie,je,ke) (iloc=ie-2*ovl, jloc=je-2*ovl) doubleprecision dataARP,dataARS call MPI_INIT_THREAD(MPI_FUNNELED,…) … call MPI_CART_CREATE(…….,comm2d,…) … !$omp parallel shared(dataSR,dataARS,dataARP) … !$omp master ( can be generalized to a subteam with !$omp if(lcomm.eq.true.) ) initialize derived datatypes for SENDRECV as for pure MPI !$omp end master … !$omp master (other threads skip the master construct and begin computing) call SENDRECV(dataSR(..,..,1),1,……,comm2d) call MPI_ALLREDUCE(dataARP,dataARS…,comm2d) !$omp end master do some COMPUTATION (not involving dataSR, dataARS and dataARP) … !$omp end parallel … call MPI_FINALIZE(…)

A multithreaded approach A simple alternative: use !$OMP SECTIONS without modifying COSMO code (you can control for # of communicating threads) Requires explicit coding for loop level parallelization to be cache-friendly

doubleprecision dataSR(ie,je,ke) (iloc=ie-2*ovl, jloc=je-2*ovl) doubleprecision dataARP,dataARS(nth) call MPI_INIT_THREAD(MPI_MULTIPLE,…) … call MPI_CART_CREATE(…….,comm2d,…) perform some domain decomposition … !$omp parallel private(dataARP),shared(dataSR,dataARS) … iam=OMP_GET_THREAD_NUM() nth=OMP_GET_NUM_THREADS() startk=ke*iam/nth+1 endk=ke*(iam+1)/nth (distribute the halo wall between a team of nth threads) totk=endk-startk … initialize datatypes for SENDRECV (each thread has its own 1/nth part of the wall) duplicate MPI communicator for threads with same thread number and different MPI rank … call SENDRECV(dataSR(..,..,startk),1,……,mycomm2d(iam+1)…) (all threads call MPI) call MPI_ALLREDUCE(dataARP,dataARS(iam+1),….,mycomm2d(iam+1),….) … (each group of thread can perform a different collective operation) !$omp end parallel … call MPI_FINALIZE(…) A multithreaded approach (cont’d)

Experimental setting Benchmark runs on Linux cluster MATRIX (CASPUR): 256 dual-socket quad-core AMD 2.1 Ghz (Intel compiler , OpenMPI 1.4.1) Linux cluster PORDOI (CNMCA): 128 dual-socket quad-core Intel Xeon 3.00 Ghz (Intel compiler ,HP-MPI) Only one call to sendrecv operation (to avoid meaningless results) We can control for MPI processes/cores binding We cannot explicitly control for threads/cores binding with AMD: it is possible with Intel CPUs either as a compiler directive (-par-affinity option), or at runtime (KMP_SET_AFFINITY env. variable). not a major issue from early results on hybrid loop level parallelization of Leapfrog dynamics in COSMO.

Experimental results (MATRIX) 8 threads per MPI proc (1 MPI proc per node) 4 threads per MPI proc (1 MPI proc per socket) 2 threads per MPI proc (2 MPI procs per socket) Comparison of average times for point to point communications MPI_MULTIPLE overhead for large number of processes Computational times are comparable until 512 cores (at least). In test case considered, message sizes were always under the eager limit MPI_MULTIPLE will benefit if it determines the protocol switching.

Experimental results (PORDOI) 8 threads per MPI proc (1 MPI proc per node) 4 threads per MPI proc (1 MPI proc per socket) 2 threads per MPI proc (2 MPI procs per socket) Comparison of average times for point to point communications MPI_MULTIPLE overhead for large number of processes Computational times are comparable until 1024 cores (at least).

LibNBC Collective communications implemented in LibNBC library using a schedule of rounds: each round made by non-blocking point to point communications Progress in background can be advanced using functions similar to MPI_Test (local operations) Completion of collective operation through MPI_WAIT (non-local) Fortran binding not completely bug-free For additional details:

Possible strategies (with leapfrog) One ALLREDUCE operation per time step on variable zuvwmax(0:ke) (0 index CFL, 1:ke needed for satad) For Runge-Kutta module only zwmax(1:ke) variable (satad) COSMO CODE call global_values (zuvwmax(0:ke),….) … CFL check with zuvwmax(0) … many computations (incl fast_waves) do k=1,ke if(zuvwmax(k)….) kitpro=… call satad (…,kitpro,..) enddo COSMO-NBC (no overlap) handle=0 call NBC_IALLREDUCE(zuvwmax(0:ke), …,handle,…) call NBC_TEST(handle,ierr) if(ierr.ne.0) call NBC_WAIT(handle,ierr) … CFL check with zuvwmax(0) … many computations (incl fast_waves) do k=1,ke if(zuvwmax(k)….) kitpro=… call satad (…,kitpro,..) enddo COSMO-NBC (with overlap) call global_values(zuvwmax(0),…) handle=0 call NBC_IALLREDUCE(zuvwmax(1:ke), …,handle,…) … CFL check with zuvwmax(0) … many computations (incl fast_waves) call NBC_TEST(handle,ierr) if(ierr.ne.0) call NBC_WAIT(handle,ierr) do k=1,ke if(zuvwmax(k)….) kitpro=… call satad (…,kitpro,..) enddo

Early results Preliminary results with 504 computing cores, 20x25 PEs, and global grid: 641x401x hours of forecast simulation Results from MATRIX with COSMO yutimings (similar for PORDOI) avg comm dyn (s)total time (s) COSMO 63,31 611,23 COSMO-NBC (no ovl) 69,10 652,45 COSMO-NBC (with ovl) 54,94 611,55 No benefits without changing the source code (synchronization in MPI_Allreduce on zuvwmax(0) ) We must identify the right place in COSMO where to issue NBC calls (post operation and then tests)

Calling tree for Leapfrog dynamics Preliminary parallelization has been performed inserting openMP directives without modifying the preexisting COSMO code. Orphan directives cannot be implemented in the subroutines due to a great number of automatic variables Master only approach for MPI communications. Parallelized subroutines

Loop level parallelization Leapfrog dynamical core in COSMO code possesses many grid sweeps over the local (possibly including halo dofs) part of the domain Much of the grid sweeps can be parallelized on the outermost (vertical) loop !$omp do do k=1,ke do j=1,je do i=1,ie perform something… enddo !$omp end do Much of the grid sweeps are well load-balanced About 350 openMP directives inserted into the code Hard tuning not yet performed (waiting for the results of other tasks)

Preliminary results from MATRIX 1 hour of forecast global grid 641x401x40 fixed number of computing cores 192 Pure MPI12x16 Hybrid 2 threads 8x12 Hybrid 4 threads 6x8 Hybrid 8 threads 4x6 Domain decompositions Plot of ratios: Hybrid times / MPI times of dynamical core from COSMO YUTIMING

Preliminary results from PORDOI Scalasca toolset used to profile the hybrid version of COSMO –Easy-to-use –Produces a fully surfable calling tree using different metrics –Automatically profiles each openMP construct using OPARI –User friendly GUI (cube3) For details, see Computational domain considered 749x401x40 Results shown for a test case of 2 hours of forecast and fixed number of computing cores (400=MPIxOpenMP) Issues of internode scalability for fast waves highlighted

Scalasca profiling

Observations Timing results with 2 or 4 threads are comparable with the pure MPI case. Poor results on hybrid speedup with 8 threads. Same conclusions with fixed number of MPI procs and growing with the number of threads and cores. (poor internode scalability) COSMO code needs to be modified to eliminate wasting of CPU time at explicit barriers (use more threads to communicate?). Use dynamic scheduling and explicit control of shared variable (see next slide) as a possible overlap strategy avoiding implicit barriers. OpenMP overhead (threads wake up, loop scheduling) negligible Try to avoid !$omp workshare -> write explicitly each array sintax Keep as low as possible the number of !$omp single constructs Segmentation faults experienced using collapse directives. Why?

Thread subteams not yet a standard in openMP 3.0 How we can avoid wasting of CPU time when master thread is communicating without major modifications to the code and with minor overhead? chunk=ke/(numthreads-1)+1 !$ checkvar(1:ke)=.false. !$omp parallel private(checkflag,k) …………. !$omp master ! Exchange halo for variables needed later in the code (not AA) !$omp end master !$ checkflag=.false. !$omp do schedule(dynamic,chunk) do k=1,ke ! Perform something writing in variable AA !$ checkvar(k)=.true. enddo !$omp end do nowait !$ do while(.not.checkflag) (Each thread loops separately until previous loop is finished) !$ checkflag=.true. !$ do k=1,ke !$ checkflag = checkflag.and.checkvar(k) !$ end do !$omp do schedule(dynamic,chunk) (Now we can enter this loop safely) do k=1,ke ! Perform something else reading from variable AA enddo !$omp end do nowait …………. !$omp end parallel

Thank you for your attention!