Mixed MPI/OpenMP programming on HPCx Mark Bull, EPCC with thanks to Jake Duthie and Lorna Smith.

Slides:



Advertisements
Similar presentations
OpenMP Optimization National Supercomputing Service Swiss National Supercomputing Center.
Advertisements

Resource Management §A resource can be a logical, such as a shared file, or physical, such as a CPU (a node of the distributed system). One of the functions.
Using DSVM to Implement a Distributed File System Ramon Lawrence Dept. of Computer Science
CISC October Goals for today: Foster’s parallel algorithm design –Partitioning –Task dependency graph Granularity Concurrency Collective communication.
MPI and C-Language Seminars Seminar Plan  Week 1 – Introduction, Data Types, Control Flow, Pointers  Week 2 – Arrays, Structures, Enums, I/O,
Reference: Message Passing Fundamentals.
DISTRIBUTED AND HIGH-PERFORMANCE COMPUTING CHAPTER 7: SHARED MEMORY PARALLEL PROGRAMMING.
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Parallel Programming in C with MPI and OpenMP Michael J. Quinn.
CS 524 (Wi 2003/04) - Asim LUMS 1 Cache Basics Adapted from a presentation by Beth Richardson
A. Frank - P. Weisberg Operating Systems Introduction to Tasks/Threads.
1 Computer Science, University of Warwick Architecture Classifications A taxonomy of parallel architectures: in 1972, Flynn categorised HPC architectures.
Tiered architectures 1 to N tiers. 2 An architectural history of computing 1 tier architecture – monolithic Information Systems – Presentation / frontend,
The hybird approach to programming clusters of multi-core architetures.
Performance Comparison of Pure MPI vs Hybrid MPI-OpenMP Parallelization Models on SMP Clusters Nikolaos Drosinos and Nectarios Koziris National Technical.
Advanced Hybrid MPI/OpenMP Parallelization Paradigms for Nested Loop Algorithms onto Clusters of SMPs Nikolaos Drosinos and Nectarios Koziris National.
Introduction to Symmetric Multiprocessors Süha TUNA Bilişim Enstitüsü UHeM Yaz Çalıştayı
1 Reasons for parallelization Can we make GA faster? One of the most promising choices is to use parallel implementations. The reasons for parallelization.
Reference: / Parallel Programming Paradigm Yeni Herdiyeni Dept of Computer Science, IPB.
Computer System Architectures Computer System Software
Efficient Parallel Implementation of Molecular Dynamics with Embedded Atom Method on Multi-core Platforms Reporter: Jilin Zhang Authors:Changjun Hu, Yali.
Lecture 4: Parallel Programming Models. Parallel Programming Models Parallel Programming Models: Data parallelism / Task parallelism Explicit parallelism.
Distributed Shared Memory Systems and Programming
Parallel Programming Models Jihad El-Sana These slides are based on the book: Introduction to Parallel Computing, Blaise Barney, Lawrence Livermore National.
Multiple Processor Systems. Multiprocessor Systems Continuous need for faster and powerful computers –shared memory model ( access nsec) –message passing.
1 Interconnects Shared address space and message passing computers can be constructed by connecting processors and memory unit using a variety of interconnection.
AN EXTENDED OPENMP TARGETING ON THE HYBRID ARCHITECTURE OF SMP-CLUSTER Author : Y. Zhao 、 C. Hu 、 S. Wang 、 S. Zhang Source : Proceedings of the 2nd IASTED.
OpenMP OpenMP A.Klypin Shared memory and OpenMP Simple Example Threads Dependencies Directives Handling Common blocks Synchronization Improving load balance.
Scalable Web Server on Heterogeneous Cluster CHEN Ge.
Planned AlltoAllv a clustered approach Stephen Booth (EPCC) Adrian Jackson (EPCC)
Evaluating FERMI features for Data Mining Applications Masters Thesis Presentation Sinduja Muralidharan Advised by: Dr. Gagan Agrawal.
Hybrid MPI and OpenMP Parallel Programming
Supercomputing ‘99 Parallelization of a Dynamic Unstructured Application using Three Leading Paradigms Leonid Oliker NERSC Lawrence Berkeley National Laboratory.
Multiple Processor Systems. Multiprocessor Systems Continuous need for faster computers –shared memory model ( access nsec) –message passing multiprocessor.
Distributed Information Systems. Motivation ● To understand the problems that Web services try to solve it is helpful to understand how distributed information.
10/02/2012CS4230 CS4230 Parallel Programming Lecture 11: Breaking Dependences and Task Parallel Algorithms Mary Hall October 2,
Cache Coherence Protocols 1 Cache Coherence Protocols in Shared Memory Multiprocessors Mehmet Şenvar.
Lecture 4 TTH 03:30AM-04:45PM Dr. Jianjun Hu CSCE569 Parallel Computing University of South Carolina Department of.
Parallelization of likelihood functions for data analysis Alfio Lazzaro CERN openlab Forum on Concurrent Programming Models and Frameworks.
M. Accetta, R. Baron, W. Bolosky, D. Golub, R. Rashid, A. Tevanian, and M. Young MACH: A New Kernel Foundation for UNIX Development Presenter: Wei-Lwun.
Finding concurrency Jakub Yaghob. Finding concurrency design space Starting point for design of a parallel solution Analysis The patterns will help identify.
Threaded Programming Lecture 2: Introduction to OpenMP.
Parallelization Strategies Laxmikant Kale. Overview OpenMP Strategies Need for adaptive strategies –Object migration based dynamic load balancing –Minimal.
Chapter 7: Consistency & Replication IV - REPLICATION MANAGEMENT By Jyothsna Natarajan Instructor: Prof. Yanqing Zhang Course: Advanced Operating Systems.
CC-MPI: A Compiled Communication Capable MPI Prototype for Ethernet Switched Clusters Amit Karwande, Xin Yuan Department of Computer Science, Florida State.
Threaded Programming Lecture 1: Concepts. 2 Overview Shared memory systems Basic Concepts in Threaded Programming.
3/12/2013Computer Engg, IIT(BHU)1 PARALLEL COMPUTERS- 2.
A Pattern Language for Parallel Programming Beverly Sanders University of Florida.
Motivation: dynamic apps Rocket center applications: –exhibit irregular structure, dynamic behavior, and need adaptive control strategies. Geometries are.
Exploring Parallelism with Joseph Pantoga Jon Simington.
Concurrency and Performance Based on slides by Henri Casanova.
Background Computer System Architectures Computer System Software.
CMSC 611: Advanced Computer Architecture Shared Memory Most slides adapted from David Patterson. Some from Mohomed Younis.
First INFN International School on Architectures, tools and methodologies for developing efficient large scale scientific computing applications Ce.U.B.
INTRODUCTION TO HIGH PERFORMANCE COMPUTING AND TERMINOLOGY.
1 ”MCUDA: An efficient implementation of CUDA kernels for multi-core CPUs” John A. Stratton, Sam S. Stone and Wen-mei W. Hwu Presentation for class TDT24,
Group Members Hamza Zahid (131391) Fahad Nadeem khan Abdual Hannan AIR UNIVERSITY MULTAN CAMPUS.
Introduction to Parallel Computing: MPI, OpenMP and Hybrid Programming
Distributed Shared Memory
Parallel Programming By J. H. Wang May 2, 2017.
Parallel Objects: Virtualization & In-Process Components
Programming Models for SimMillennium
Parallel Programming in C with MPI and OpenMP
CMSC 611: Advanced Computer Architecture
Chapter 4: Threads.
Hybrid Programming with OpenMP and MPI
COMP60621 Fundamentals of Parallel and Distributed Systems
PERFORMANCE MEASURES. COMPUTATIONAL MODELS Equal Duration Model:  It is assumed that a given task can be divided into n equal subtasks, each of which.
Memory System Performance Chapter 3
Parallel Programming in C with MPI and OpenMP
COMP60611 Fundamentals of Parallel and Distributed Systems
Presentation transcript:

Mixed MPI/OpenMP programming on HPCx Mark Bull, EPCC with thanks to Jake Duthie and Lorna Smith

2 Motivation HPCx, like many current high-end systems, is a cluster of SMP nodes Such systems can generally be adequately programmed using just MPI. It is also possible to write mixed MPI/OpenMP code –seems to be the best match of programming model to hardware –what are the (dis)advantages of using this model?

3 Implementation

4 Styles of mixed code Master only –all communication is done by the OpenMP master thread, outside of parallel regions –other threads are idle during communication. Funnelled –all communication is done by one OpenMP thread, but this may occur inside parallel regions –other threads may be computing during communication Multiple –several threads (maybe all) communicate

5 Issues We need to consider: Development / maintenance costs Portability Performance

6 Development / maintenance In most cases, development and maintenance will be harder than for an MPI code, and much harder than for an OpenMP code. If MPI code already exists, addition of OpenMP may not be too much overhead. –easier if parallelism is nested –can use master-only style, but this may not deliver performance: see later In some cases, it may be possible to use a simpler MPI implementation because the need for scalability is reduced. –e.g. 1-D domain decomposition instead of 2-D

7 Portability Both OpenMP and MPI are themselves highly portable (but not perfect). Combined MPI/OpenMP is less so –main issue is thread safety of MPI –if full thread safety is assumed (multiple style), portability will be reduced –batch environments have varying amounts of support for mixed mode codes. Desirable to make sure code functions correctly (with conditional compilation) as stand-alone MPI code and as stand-alone OpenMP code.

8 Performance Possible reasons for using mixed MPI/OpenMP codes on HPCx: 1. Intra-node MPI overheads 2. Contention for network 3. Poor MPI implementation 4. Poorly scaling MPI codes 5. Replicated data

9 Intra-node MPI overheads Simple argument: –Use of OpenMP within a node avoids overheads associated with calling the MPI library. –Therefore a mixed OpenMP/MPI implementation will outperform a pure MPI version.

10 Intra-node MPI overheads Complicating factors: –The OpenMP implementation may introduce additional overheads not present in the MPI code e.g. synchronisation, false sharing, sequential sections –The mixed implementation may require more synchronisation than a pure OpenMP version especially if non-thread-safety of MPI is assumed. –Implicit point-to-point synchronisation may be replaced by (more expensive) OpenMP barriers. –In the pure MPI code, the intra-node messages will often be naturally overlapped with inter-node messages Harder to overlap inter-thread communication with inter-node messages.

11 Example DO I=1,N A(I) = B(I) + C(I) END DO CALL MPI_BSEND(A(N),1,.....) CALL MPI_RECV(A(0),1,.....) DO I = 1,N D(I) = A(I-1) + A(I) ENDO !$omp parallel do Intra-node messages overlapped with inter-node cache miss to access received data Implicit OpenMP barrier added here other threads idle while master does MPI calls Single MPI task may not use all network bandwidth cache miss to access message data

12 Mixed mode styles again... Master-only style of mixed coding introduces significant overheads –often outweighs benefits Can use funnelled or multiple styles to overcome this –typically much harder to develop and maintain –load balancing compute/communicate threads in funnelled style –mapping both processes and threads to a topology in multiple style

13 Competition for network On a node with p processors, we will often have the situation where all p MPI processes (in a pure MPI code) will send a message off node at the same time. This may cause contention for network ports (or other hardware resource) May be better to send a single message which is p times the length. On the other hand, a single MPI task may not be able to utilise all the network bandwidth

14 On Phase 1 machine (Colony switch), off-node bandwidth is (almost) independent of the number of MPI tasks. –PCI adapter is the bottleneck –may change with on Phase 2 (Federation switch) Test: ping-pong between 2 nodes. –vary the number of task pairs from 1 to 8. –measure aggregate bandwidth for large messages (8 Mbytes)

15 Aggregate bandwidth

16 Poor MPI implementation If the MPI implementation is not cluster- aware, then a mixed-mode code may have some advantages. A good implementation of collective communications should minimise inter-node messages. –e.g. do reduction within nodes, then across nodes A mixed-mode code would achieve this naturally –e.g. OpenMP reduction within node, MPI reduction across nodes.

17 Optimising collectives MPI on Phase 1 system is not cluster aware Can improve performance of some collectives by up to a factor of 2 by using shared memory within a node. Most of this performance gain can be attained in pure MPI by using split communicators –subset of optimised collectives is available –contact Stephen Booth ( ) if interested MPI on Phase 2 system should be cluster- aware

18 Poorly scaling MPI codes If the MPI version of the code scales poorly, then a mixed MPI/OpenMP version may scale better. May be true in cases where OpenMP scales better than MPI due to: 1. Algorithmic reasons. –e.g. adaptive/irregular problems where load balancing in MPI is difficult. 2. Simplicity reasons –e.g. 1-D domain decomposition 3. Reduction in communication –often only occurs if dimensionality of communication pattern is reduced

19 Most likely to be successful on fat node clusters (few MPI processes) May be more attractive on Phase 2 system –32 processors per node instead of 8

20 Replicated data Some MPI codes use a replicated data strategy –all processes have a copy of a major data structure A pure MPI code needs one copy per process(or). A mixed code would only require one copy per node –data structure can be shared by multiple threads within a process. Can use mixed code to increase the amount of memory available per task.

21 On HPCx, the amount of memory available to a task is ~7.5Gb divided by the number of tasks per node. –for large memory jobs, can request less than 8 tasks per node –since charging is by node usage this is rather expensive –mixed mode codes can make some use of the spare processors, even if they are not particularly efficient.

22 Some cases studies Simple Jacobi kernel –2-D domain, halo exchanges and global reduction ASCI Purple benchmarks –UMT2K photon transport on 3-D unstructured mesh –sPPM gas dynamics on 3-D regular domain

23 Simple Jacobi kernel

24 Results are for 32 processors Mixed mode is slightly faster than MPI Collective communication cost reduced with more threads Point-to-point communincation costs increase with more threads –extra cache misses observed Choice of process+thread geometry is crucial –affects computation time significantly

25 UMT2K Nested parallelism –OpenMP parallelism at lower level than MPI Master-only style –Implemented with a single OpenMP parallel for directive. Mixed mode is consistently slower –OpenMP doesn’t scale well

26 UMT2K performance

27 sPPM Parallelism is essentially at one level –MPI decomposition and OpenMP parallel loops both over physical domain Funnelled style –overlapped communication and calculation with dynamic load balancing NB: not always suitable as it can destroy data locality Mixed mode is significantly faster –main gains appear to be reduction in inter-node communication –in some places, avoids MPI communication in one of the three dimensions

28

29 What should you do? Don’t rush: you need to argue a very good case for a mixed-mode implementation. If MPI scales better than OpenMP within a node, you are unlikely to see a benefit. –requirement for large memory is an exception The simple “master-only” style is unlikely to work. It may be better to consider making your algorithm/MPI implementation cluster aware. (e.g. use nested communicators, match inner dimension to a node,....)

30 Conclusions Clearly not the most effective mechanism for all codes. Significant benefit may be obtained in certain situations: –poor scaling with MPI processes –replicated data codes Unlikely to benefit well optimised existing MPI codes. Portability and development / maintenance considerations.