Non blocking communications in RK dynamics

Slides:



Advertisements
Similar presentations
Investigating Distributed Caching Mechanisms for Hadoop Gurmeet Singh Puneet Chandra Rashid Tahir.
Advertisements

1 What is message passing? l Data transfer plus synchronization l Requires cooperation of sender and receiver l Cooperation not always apparent in code.
Part IV: Memory Management
CGrid 2005, slide 1 Empirical Evaluation of Shared Parallel Execution on Independently Scheduled Clusters Mala Ghanesh Satish Kumar Jaspal Subhlok University.
Parallel Jacobi Algorithm Steven Dong Applied Mathematics.
MINJAE HWANG THAWAN KOOBURAT CS758 CLASS PROJECT FALL 2009 Extending Task-based Programming Model beyond Shared-memory Systems.
Parallel Processing1 Parallel Processing (CS 667) Lecture 9: Advanced Point to Point Communication Jeremy R. Johnson *Parts of this lecture was derived.
Dynamic Load Balancing for VORPAL Viktor Przebinda Center for Integrated Plasma Studies.
1 Buffers l When you send data, where does it go? One possibility is: Process 0Process 1 User data Local buffer the network User data Local buffer.
MPI and C-Language Seminars Seminar Plan  Week 1 – Introduction, Data Types, Control Flow, Pointers  Week 2 – Arrays, Structures, Enums, I/O,
Point-to-Point Communication Self Test with solution.
Parallel Algorithms Research Computing UNC - Chapel Hill Instructor: Mark Reed
MEMORY MANAGEMENT By KUNAL KADAKIA RISHIT SHAH. Memory Memory is a large array of words or bytes, each with its own address. It is a repository of quickly.
Testing a program Remove syntax and link errors: Look at compiler comments where errors occurred and check program around these lines Run time errors:
DMI Update Leif Laursen ( ) Jan Boerhout ( ) CAS2K3, September 7-11, 2003 Annecy, France.
Scheduling of Tiled Nested Loops onto a Cluster with a Fixed Number of SMP Nodes Maria Athanasaki, Evangelos Koukis, Nectarios Koziris National Technical.
Early experiences on COSMO hybrid parallelization at CASPUR/USAM Stefano Zampini Ph.D. CASPUR COSMO-POMPA Kick-off Meeting, 3-4 May 2011, Manno (CH)
Experiments with running ALADIN on LINUX PC, using different FORTRAN compilers Andrey Bogatchev NIMH,Bulgaria.
Introduction Overview Static analysis Memory analysis Kernel integrity checking Implementation and evaluation Limitations and future work Conclusions.
A Metadata Based Approach For Supporting Subsetting Queries Over Parallel HDF5 Datasets Vignesh Santhanagopalan Graduate Student Department Of CSE.
Pursuing Faster I/O in COSMO POMPA Workshop May 3rd 2010.
Porting the physical parametrizations on GPUs using directives X. Lapillonne, O. Fuhrer, Cristiano Padrin, Piero Lanucara, Alessandro Cheloni Eidgenössisches.
Experience with COSMO MPI/OpenMP hybrid parallelization Matthew Cordery, William Sawyer Swiss National Supercomputing Centre Ulrich Schättler Deutscher.
CY2003 Computer Systems Lecture 09 Memory Management.
Planned AlltoAllv a clustered approach Stephen Booth (EPCC) Adrian Jackson (EPCC)
PP POMPA (WG6) Overview Talk COSMO GM11, Rome st Birthday.
Performance Oriented MPI Jeffrey M. Squyres Andrew Lumsdaine NERSC/LBNL and U. Notre Dame.
1 06/09/2011, COSMO GM Xavier Lapillonne Porting the physical parametrizations on GPU using directives X. Lapillonne, O. Fuhrer Eidgenössisches Departement.
MPI (continue) An example for designing explicit message passing programs Advanced MPI concepts.
Manno, , © by Supercomputing Systems 1 1 COSMO - Dynamical Core Rewrite Approach, Rewrite and Status Tobias Gysi POMPA Workshop, Manno,
Message-Passing Computing Chapter 2. Programming Multicomputer Design special parallel programming language –Occam Extend existing language to handle.
MPI Point to Point Communication CDP 1. Message Passing Definitions Application buffer Holds the data for send or receive Handled by the user System buffer.
Copyright © Texas Education Agency, Advanced Computer Programming Data Structures: Basics.
Parallel Algorithms & Implementations: Data-Parallelism, Asynchronous Communication and Master/Worker Paradigm FDI 2007 Track Q Day 2 – Morning Session.
OpenMP Runtime Extensions Many core Massively parallel environment Intel® Xeon Phi co-processor Blue Gene/Q MPI Internal Parallelism Optimizing MPI Implementation.
SPIDAL Java Optimized February 2017 Software: MIDAS HPC-ABDS
HPC Roadshow Overview of HPC systems and software available within the LinkSCEEM project.
Community Grids Laboratory
Supervisor: Andreas Gellrich
4D-VAR Optimization Efficiency Tuning
Data Structure Interview Question and Answers
Implementation and Optimization of MPI point-to-point communications on SMP-CMP clusters with RDMA capability.
MPI Point to Point Communication
Parallel Algorithm Design
Core i7 micro-processor
PP POMPA status Xavier Lapillonne.
Overview of the COSMO NWP model
Is System X for Me? Cal Ribbens Computer Science Department
IEEE BigData 2016 December 5-8, Washington D.C.
Experience with Maintaining the GPU Enabled Version of COSMO
Chapter 8: Main Memory.
Parallel Programming with MPI and OpenMP
Web Systems Development (CSC-215)
MEMORY MANAGEMENT & their issues
More on MPI Nonblocking point-to-point routines Deadlock
MPI-Message Passing Interface
P A R A L L E L C O M P U T I N G L A B O R A T O R Y
CSE 451: Operating Systems Autumn 2003 Lecture 16 RPC
CSCE569 Parallel Computing
Bin Ren, Gagan Agrawal, Brad Chamberlain, Steve Deitz
High Performance Computing and TupleSpaces
Hybrid Programming with OpenMP and MPI
Cristiano Padrin (CASPUR)
Indiana University, Bloomington
BBIT 212/ CISY 111 Object Oriented Programming (OOP)
More on MPI Nonblocking point-to-point routines Deadlock
January 15, 2004 Adrienne Noble
CSE 451: Operating Systems Winter 2003 Lecture 16 RPC
Performance and Code Tuning Overview
Presentation transcript:

Non blocking communications in RK dynamics Non blocking communications in RK dynamics. Current status and future work. Stefano Zampini, CASPUR/CNMCA WG-6 PP POMPA @ Cosmo GM Rome, September 6 2011

Halo exchange in Cosmo 3 types of point to point communications: 2 partially non blocking and 1 full blocking (with MPI_SENDRECV) Halo swapping needs completion of East to West before starting South to North communication (implicit corner exchange) Also: choice for explicit buffering or derived MPI datatypes

Details on nonblocking exchange Full halo exchange including corners: 2x messages, same amount of data on network. 3 different stages: send, receive and wait. Minimizing overhead: at first time step persistent requests are created using calls to MPI_SEND_INIT and MPI_RECV_INIT. During model run: MPI_STARTALL used for starting requests. MPI_TESTANY/MPI_WAITANY used for completion. Actual implementation with explicit send and receive buffering only: needs to be extended to derived MPI datatypes. Strategy used in RK dynamics (manual implementation): - Sends are posted whenever needed data has been locally computed. - Receives are posted whenever receive buffer is ready to be used. - Waits are posted just before data is needed for next local computation.

New synopsis for swap subroutine Actual call to subroutine exchg_boundaries 4 more argument in call to subroutine iexchg_boundaries - ilocalreq(16): array of request (integer declared as module variable, one for each swap scenario inside the module) - operation(3): array of logicals indicating stage to perform (send,recv,wait) - istartpar,iendpar: needed for corners' definition

New Implementation

Benchmark details COSMO RAPS 5.0 with MeteoSwiss namelist (25h hours of forecast) Cosmo2 (520x350x60, dt 20) and Cosmo7 (393x338x60, dt 60) Decompositions: tiny (10x12+4), small (20x24+4) and usual (28x35+4) Code compiled with Intel ifort 11.1.072 and HPMPI COMFLG1 = -xssse3 -O3 -fp-model precise -free -fpp -override-limits -convert big_endian COMFLG2 = -xssse3 -O3 -fp-model precise -free -fpp -override-limits -convert big_endian COMFLG3 = -xssse3 -O2 -fp-model precise -free -fpp -override-limits -convert big_endian COMFLG4 = -xssse3 -O2 -fp-model precise -free -fpp -override-limits -convert big_endian LDFLG = -finline-functions -O3 Runs on PORDOI linux cluster at CNMCA:128 dual-socket quad-core nodes (1024 total cores) Each socket: quad core Intel Xeon E5450 @3.00 Ghz with 1 GB RAM for each core Profiling with Scalasca 1.3.3 (very small overhead)

Early results: COSMO 7 Total time (s) for model runs Mean total time for RK dynamics

Early results: COSMO2 Total time (s) for model runs Mean total time for RK dynamics

Comments and future works Almost same computational times for test cases considered with INTEL-HPMPI configuration Not shown: 5% improve in computational times for PGI-MVAPICH2 (but with worse absolute times) CFL check performed only locally with izdebug<2. Still a lot of sinchronization in collective calls during multiplicative filling in semi- lagrange scheme: Allreduce and Allgather operations in multiple calls to sum_DDI subroutine (bottleneck for number of cores > 1000) Bad perfomances in w_bbc_rk_up5 during RK loop over small time steps. Rewrite loop code? What about automatic detection/insertion of swapping calls in microphysics and other parts of code? Is Testany/Waitany the most efficient way to assure completion?