Distributed Systems CS 15-440 Programming Models Gregory Kesden Borrowed and adapted from our good friends at CMU-Doha, Qatar Majd F. Sakr, Mohammad Hammoud.

Slides:



Advertisements
Similar presentations
MPI Message Passing Interface
Advertisements

Distributed Systems CS
Its.unc.edu 1 Collective Communication University of North Carolina - Chapel Hill ITS - Research Computing Instructor: Mark Reed
Sahalu Junaidu ICS 573: High Performance Computing 8.1 Topic Overview Matrix-Matrix Multiplication Block Matrix Operations A Simple Parallel Matrix-Matrix.
A Message Passing Standard for MPP and Workstations Communications of the ACM, July 1996 J.J. Dongarra, S.W. Otto, M. Snir, and D.W. Walker.
Lesson2 Point-to-point semantics Embarrassingly Parallel Examples.
MPI Point-to-Point Communication CS 524 – High-Performance Computing.
Collective Communication.  Collective communication is defined as communication that involves a group of processes  More restrictive than point to point.
Distributed Systems CS Programming Models- Part II Lecture 17, Nov 2, 2011 Majd F. Sakr, Mohammad Hammoud andVinay Kolar 1.
1 Tuesday, October 10, 2006 To err is human, and to blame it on a computer is even more so. -Robert Orben.
Distributed Systems CS Programming Models- Part I Lecture 16, Oct 31, 2011 Majd F. Sakr, Mohammad Hammoud andVinay Kolar 1.
A Brief Look At MPI’s Point To Point Communication Brian T. Smith Professor, Department of Computer Science Director, Albuquerque High Performance Computing.
Basics of Message-passing Mechanics of message-passing –A means of creating separate processes on different computers –A way to send and receive messages.
Mapping Techniques for Load Balancing
Parallel Programming with Java
CS 179: GPU Programming Lecture 20: Cross-system communication.
Lecture 4: Parallel Programming Models. Parallel Programming Models Parallel Programming Models: Data parallelism / Task parallelism Explicit parallelism.
A Message Passing Standard for MPP and Workstations Communications of the ACM, July 1996 J.J. Dongarra, S.W. Otto, M. Snir, and D.W. Walker.
L15: Putting it together: N-body (Ch. 6) October 30, 2012.
Parallel Processing1 Parallel Processing (CS 676) Lecture 7: Message Passing using MPI * Jeremy R. Johnson *Parts of this lecture was derived from chapters.
1 MPI: Message-Passing Interface Chapter 2. 2 MPI - (Message Passing Interface) Message passing library standard (MPI) is developed by group of academics.
Parallel Computing A task is broken down into tasks, performed by separate workers or processes Processes interact by exchanging information What do we.
Parallel Programming with MPI Prof. Sivarama Dandamudi School of Computer Science Carleton University.
CS 838: Pervasive Parallelism Introduction to MPI Copyright 2005 Mark D. Hill University of Wisconsin-Madison Slides are derived from an online tutorial.
MPI Communications Point to Point Collective Communication Data Packaging.
Message Passing Programming Model AMANO, Hideharu Textbook pp. 140-147.
Summary of MPI commands Luis Basurto. Large scale systems Shared Memory systems – Memory is shared among processors Distributed memory systems – Each.
1 Overview on Send And Receive routines in MPI Kamyar Miremadi November 2004.
MPI (continue) An example for designing explicit message passing programs Advanced MPI concepts.
Distributed Systems CS /640 Programming Models Borrowed and adapted from our good friends at CMU-Doha, Qatar Majd F. Sakr, Mohammad Hammoud andVinay.
Parallel Programming with MPI By, Santosh K Jena..
CSCI-455/522 Introduction to High Performance Computing Lecture 4.
1 Message Passing Models CEG 4131 Computer Architecture III Miodrag Bolic.
Message-Passing Computing Chapter 2. Programming Multicomputer Design special parallel programming language –Occam Extend existing language to handle.
Lecture 3 : Performance of Parallel Programs Courtesy : MIT Prof. Amarasinghe and Dr. Rabbah’s course note.
Chapter 2: Parallel Programming Models Yan Solihin Copyright notice: No part of this publication may be reproduced, stored in a retrieval.
MPI Point to Point Communication CDP 1. Message Passing Definitions Application buffer Holds the data for send or receive Handled by the user System buffer.
1 BİL 542 Parallel Computing. 2 Message Passing Chapter 2.
Introduction to Parallel Programming at MCSR Message Passing Computing –Processes coordinate and communicate results via calls to message passing library.
-1.1- MPI Lectured by: Nguyễn Đức Thái Prepared by: Thoại Nam.
Parallel Algorithms & Implementations: Data-Parallelism, Asynchronous Communication and Master/Worker Paradigm FDI 2007 Track Q Day 2 – Morning Session.
Message Passing Programming Based on MPI Collective Communication I Bora AKAYDIN
Message Passing Interface Using resources from
Background Computer System Architectures Computer System Software.
MPI-Message Passing Interface. What is MPI?  MPI is a specification for the developers and users of message passing libraries. By itself, it is NOT a.
Distributed Systems CS Programming Models- Part II Lecture 14, Oct 28, 2013 Mohammad Hammoud 1.
COMP7330/7336 Advanced Parallel and Distributed Computing MPI Programming: 1. Collective Operations 2. Overlapping Communication with Computation Dr. Xiao.
3/12/2013Computer Engg, IIT(BHU)1 MPI-2. POINT-TO-POINT COMMUNICATION Communication between 2 and only 2 processes. One sending and one receiving. Types:
Group Members Hamza Zahid (131391) Fahad Nadeem khan Abdual Hannan AIR UNIVERSITY MULTAN CAMPUS.
Introduction to MPI Programming Ganesh C.N.
Introduction to parallel computing concepts and technics
CS4402 – Parallel Computing
MPI Point to Point Communication
MPI Message Passing Interface
Distributed Systems CS
An Introduction to Parallel Programming with MPI
Collective Communication Operations
Distributed Systems CS
More on MPI Nonblocking point-to-point routines Deadlock
Message Passing Models
Distributed Systems CS
Lecture 14: Inter-process Communication
A Message Passing Standard for MPP and Workstations
Distributed Systems CS
Distributed Systems CS
Introduction to parallelism and the Message Passing Interface
More on MPI Nonblocking point-to-point routines Deadlock
Barriers implementations
Hardware Environment VIA cluster - 8 nodes Blade Server – 5 nodes
5- Message-Passing Programming
Presentation transcript:

Distributed Systems CS Programming Models Gregory Kesden Borrowed and adapted from our good friends at CMU-Doha, Qatar Majd F. Sakr, Mohammad Hammoud andVinay Kolar 1

Objectives Discussion on Programming Models Why parallelism? Parallel computer architectures Traditional models of parallel programming Examples of parallel processing Message Passing Interface (MPI) MapReduce Why parallelism?

Amdahl’s Law  We parallelize our programs in order to run them faster  How much faster will a parallel program run?  Suppose that the sequential execution of a program takes T 1 time units and the parallel execution on p processors takes T p time units  Suppose that out of the entire execution of the program, s fraction of it is not parallelizable while 1-s fraction is parallelizable  Then the speedup (Amdahl’s formula): 3

Amdahl’s Law: An Example  Suppose that 80% of you program can be parallelized and that you use 4 processors to run your parallel version of the program  The speedup you can get according to Amdahl is:  Although you use 4 processors you cannot get a speedup more than 2.5 times (or 40% of the serial running time) 4

Real Vs. Actual Cases  Amdahl’s argument is too simplified to be applied to real cases  When we run a parallel program, there are a communication overhead and a workload imbalance among processes in general Process 1 Process 2 Process 3 Process 4 Serial Parallel 1. Parallel Speed-up: An Ideal Case Cannot be parallelized Can be parallelized Process 1 Process 2 Process 3 Process 4 Serial Parallel 2. Parallel Speed-up: An Actual Case Cannot be parallelized Can be parallelized Load Unbalance Communication overhead

Guidelines  In order to efficiently benefit from parallelization, we ought to follow these guidelines: 1.Maximize the fraction of our program that can be parallelized 2.Balance the workload of parallel processes 3.Minimize the time spent for communication 6

Objectives Discussion on Programming Models Why parallelism? Parallel computer architectures Traditional models of parallel programming Examples of parallel processing Message Passing Interface (MPI) MapReduce Parallel computer architectures

Parallel Computer Architectures  We can categorize the architecture of parallel computers in terms of two aspects:  Whether the memory is physically centralized or distributed  Whether or not the address space is shared 8 Memory Address Space SharedIndividual Centralized SMP (Symmetric Multiprocessor)N/A Distributed NUMA (Non-Uniform Memory Access)MPP (Massively Parallel Processors) Memory Address Space SharedIndividual Centralized SMP (Symmetric Multiprocessor)N/A Distributed NUMA (Non-Uniform Memory Access)MPP (Massively Parallel Processors) Memory Address Space SharedIndividual Centralized SMP (Symmetric Multiprocessor)N/A Distributed NUMA (Non-Uniform Memory Access)MPP (Massively Parallel Processors) Memory Address Space SharedIndividual Centralized SMP (Symmetric Multiprocessor)N/A Distributed NUMA (Non-Uniform Memory Access)MPP (Massively Parallel Processors) Memory Address Space SharedIndividual Centralized UMA – SMP (Symmetric Multiprocessor)N/A Distributed NUMA (Non-Uniform Memory Access)MPP (Massively Parallel Processors)

Symmetric Multiprocessor  Symmetric Multiprocessor (SMP) architecture uses shared system resources that can be accessed equally from all processors  A single OS controls the SMP machine and it schedules processes and threads on processors so that the load is balanced 9 Processor Cache Bus or Crossbar Switch Memory I/O Processor Cache Processor Cache Processor Cache

Massively Parallel Processors  Massively Parallel Processors (MPP) architecture consists of nodes with each having its own processor, memory and I/O subsystem  An independent OS runs at each node 10 Processor Cache Interconnection Network Memory I/O Bus Processor Cache Memory I/O Bus Processor Cache Memory I/O Bus Processor Cache Memory I/O Bus

Non-Uniform Memory Access  Non-Uniform Memory Access (NUMA) architecture machines are built on a similar hardware model as MPP  NUMA typically provides a shared address space to applications using a hardware/software directory-based coherence protocol  The memory latency varies according to whether you access memory directly (local) or through the interconnect (remote). Thus the name non-uniform memory access  As in an SMP machine, a single OS controls the whole system 11

Objectives Discussion on Programming Models Why parallelizing our programs? Parallel computer architectures Traditional Models of parallel programming Examples of parallel processing Message Passing Interface (MPI) MapReduce Traditional Models of parallel programming

Models of Parallel Programming  What is a parallel programming model?  A programming model is an abstraction provided by the hardware to programmers  It determines how easily programmers can specify their algorithms into parallel unit of computations (i.e., tasks) that the hardware understands  It determines how efficiently parallel tasks can be executed on the hardware  Main Goal: utilize all the processors of the underlying architecture (e.g., SMP, MPP, NUMA) and minimize the elapsed time of your program 13

Traditional Parallel Programming Models 14 Parallel Programming Models Shared Memory Message Passing

Shared Memory Model  In the shared memory programming model, the abstraction is that parallel tasks can access any location of the memory  Parallel tasks can communicate through reading and writing common memory locations  This is similar to threads from a single process which share a single address space  Multi-threaded programs (e.g., OpenMP programs) are the best fit with shared memory programming model 15

Shared Memory Model 16 Process S1 P1 P2 P3 P4 S2 S i = Serial P j = Parallel Time Single Thread S1 Time P1P2 P3 S2 Shared Address Space Multi-Thread Process Spawn Join

Shared Memory Example for (i=0; i<8; i++) a[i] = b[i] + c[i]; sum = 0; for (i=0; i<8; i++) if (a[i] > 0) sum = sum + a[i]; Print sum; begin parallel // spawn a child thread private int start_iter, end_iter, i; shared int local_iter=4, sum=0; shared double sum=0.0, a[], b[], c[]; shared lock_type mylock; start_iter = getid() * local_iter; end_iter = start_iter + local_iter; for (i=start_iter; i<end_iter; i++) a[i] = b[i] + c[i]; barrier; for (i=start_iter; i<end_iter; i++) if (a[i] > 0) { lock(mylock); sum = sum + a[i]; unlock(mylock); } barrier; // necessary end parallel // kill the child thread Print sum; Sequential Parallel

Traditional Parallel Programming Models 18 Parallel Programming Models Shared Memory Message Passing Shared Memory

Message Passing Model  In message passing, parallel tasks have their own local memories  One task cannot access another task’s memory  Hence, to communicate data they have to rely on explicit messages sent to each other  This is similar to the abstraction of processes which do not share an address space  MPI programs are the best fit with message passing programming model 19

Message Passing Model 20 S1 P1 P2 P3 P4 S2 S = Serial P = Parallel Time Single Thread Process 0 S1 P1 S2 Time Message Passing Node 1 Process 1 S1 P1 S2 Node 2 Process 2 S1 P1 S2 Node 3 Process 3 S1 P1 S2 Node 4 Data transmission over the Network Process

Message Passing Example for (i=0; i<8; i++) a[i] = b[i] + c[i]; sum = 0; for (i=0; i<8; i++) if (a[i] > 0) sum = sum + a[i]; Print sum; Sequential Parallel id = getpid(); local_iter = 4; start_iter = id * local_iter; end_iter = start_iter + local_iter; if (id == 0) send_msg (P1, b[4..7], c[4..7]); else recv_msg (P0, b[4..7], c[4..7]); for (i=start_iter; i<end_iter; i++) a[i] = b[i] + c[i]; local_sum = 0; for (i=start_iter; i<end_iter; i++) if (a[i] > 0) local_sum = local_sum + a[i]; if (id == 0) { recv_msg (P1, &local_sum1); sum = local_sum + local_sum1; Print sum; } else send_msg (P0, local_sum);

Shared Memory Vs. Message Passing  Comparison between shared memory and message passing programming models: 22 AspectShared MemoryMessage Passing CommunicationImplicit (via loads/stores)Explicit Messages SynchronizationExplicitImplicit (Via Messages) Hardware SupportTypically RequiredNone Development EffortLowerHigher Tuning EffortHigherLower AspectShared MemoryMessage Passing CommunicationImplicit (via loads/stores)Explicit Messages SynchronizationExplicitImplicit (Via Messages) Hardware SupportTypically RequiredNone Development EffortLowerHigher Tuning EffortHigherLower AspectShared MemoryMessage Passing CommunicationImplicit (via loads/stores)Explicit Messages SynchronizationExplicitImplicit (Via Messages) Hardware SupportTypically RequiredNone Development EffortLowerHigher Tuning EffortHigherLower AspectShared MemoryMessage Passing CommunicationImplicit (via loads/stores)Explicit Messages SynchronizationExplicitImplicit (Via Messages) Hardware SupportTypically RequiredNone Development EffortLowerHigher Tuning EffortHigherLower AspectShared MemoryMessage Passing CommunicationImplicit (via loads/stores)Explicit Messages SynchronizationExplicitImplicit (Via Messages) Hardware SupportTypically RequiredNone Development EffortLowerHigher Tuning EffortHigherLower

Objectives Discussion on Programming Models Why parallelizing our programs? Parallel computer architectures Examples of parallel processing Message Passing Interface (MPI) MapReduce Traditional Models of parallel programming Examples of parallel processing

SPMD and MPMD  When we run multiple processes with message-passing, there are further categorizations regarding how many different programs are cooperating in parallel execution  We distinguish between two models: 1.Single Program Multiple Data (SPMD) model 2.Multiple Programs Multiple Data (MPMP) model 24

SPMD  In the SPMD model, there is only one program and each process uses the same executable working on different sets of data 25 a.out Node 1 Node 2 Node 3

MPMD  The MPMD model uses different programs for different processes, but the processes collaborate to solve the same problem  MPMD has two styles, the master/worker and the coupled analysis a.out Node 1Node 2 Node 3 b.out a.out Node 1 b.out Node 2 c.out Node 3 1. MPMD: Master/Slave 2. MPMD: Coupled Analysis a.out= Structural Analysis, b.out = fluid analysis and c.out = thermal analysis Example

3 Key Points  To summarize, keep the following 3 points in mind:  The purpose of parallelization is to reduce the time spent for computation  Ideally, the parallel program is p times faster than the sequential program, where p is the number of processes involved in the parallel execution, but this is not always achievable  Message-passing is the tool to consolidate what parallelization has separated. It should not be regarded as the parallelization itself 27

Objectives Discussion on Programming Models Why parallelizing our programs? Parallel computer architectures Examples of parallel processing Message Passing Interface (MPI) MapReduce Traditional Models of parallel programming Message Passing Interface (MPI)

Message Passing Interface  In this part, the following concepts of MPI will be described:  Basics  Point-to-point communication  Collective communication 29

What is MPI?  The Message Passing Interface (MPI) is a message passing library standard for writing message passing programs  The goal of MPI is to establish a portable, efficient, and flexible standard for message passing  By itself, MPI is NOT a library - but rather the specification of what such a library should be  MPI is not an IEEE or ISO standard, but has in fact, become the industry standard for writing message passing programs on HPC platforms 30

Reasons for using MPI 31 ReasonDescription Standardization MPI is the only message passing library which can be considered a standard. It is supported on virtually all HPC platforms ReasonDescription Standardization MPI is the only message passing library which can be considered a standard. It is supported on virtually all HPC platforms Portability There is no need to modify your source code when you port your application to a different platform that supports the MPI standard ReasonDescription Standardization MPI is the only message passing library which can be considered a standard. It is supported on virtually all HPC platforms Portability There is no need to modify your source code when you port your application to a different platform that supports the MPI standard Performance Opportunities Vendor implementations should be able to exploit native hardware features to optimize performance ReasonDescription Standardization MPI is the only message passing library which can be considered a standard. It is supported on virtually all HPC platforms Portability There is no need to modify your source code when you port your application to a different platform that supports the MPI standard Performance Opportunities Vendor implementations should be able to exploit native hardware features to optimize performance Functionality Over 115 routines are defined ReasonDescription Standardization MPI is the only message passing library which can be considered a standard. It is supported on virtually all HPC platforms Portability There is no need to modify your source code when you port your application to a different platform that supports the MPI standard Performance Opportunities Vendor implementations should be able to exploit native hardware features to optimize performance Functionality Over 115 routines are defined Availability A variety of implementations are available, both vendor and public domain

Programming Model  MPI is an example of a message passing programming model  MPI is now used on just about any common parallel architecture including MPP, SMP clusters, workstation clusters and heterogeneous networks  With MPI the programmer is responsible for correctly identifying parallelism and implementing parallel algorithms using MPI constructs 32

Communicators and Groups  MPI uses objects called communicators and groups to define which collection of processes may communicate with each other to solve a certain problem  Most MPI routines require you to specify a communicator as an argument  The communicator MPI_COMM_WORLD is often used in calling communication subroutines  MPI_COMM_WORLD is the predefined communicator that includes all of your MPI processes 33

Ranks  Within a communicator, every process has its own unique, integer identifier referred to as rank, assigned by the system when the process initializes  A rank is sometimes called a task ID. Ranks are contiguous and begin at zero  Ranks are used by the programmer to specify the source and destination of messages  Ranks are often also used conditionally by the application to control program execution (e.g., if rank=0 do this / if rank=1 do that) 34

Multiple Communicators  It is possible that a problem consists of several sub-problems where each can be solved concurrently  This type of application is typically found in the category of MPMD coupled analysis  We can create a new communicator for each sub-problem as a subset of an existing communicator  MPI allows you to achieve that by using MPI_COMM_SPLIT 35

Example of Multiple Communicators  Consider a problem with a fluid dynamics part and a structural analysis part, where each part can be computed in parallel Rank=0 Comm_Fluid Rank=1 Rank=2 Rank=3 Rank=0 Rank=4 Comm_Struct Rank=1 Rank=5 Rank=2 Rank=6 Rank=3 Rank=7 MPI_COMM_WORLD Ranks within MPI_COMM_WORLD are printed in red Ranks within Comm_Fluid are printed with green Ranks within Comm_Struct are printed with blue

Next Class Discussion on Programming Models Why parallelizing our programs? Parallel computer architectures Examples of parallel processing Message Passing Interface (MPI) MapReduce Traditional Models of parallel programming Message Passing Interface (MPI) Programming Models- Part II

Message Passing Interface  In this part, the following concepts of MPI will be described:  Basics  Point-to-point communication  Collective communication 38

Point-to-Point Communication  MPI point-to-point operations typically involve message passing between two, and only two, different MPI tasks  One task performs a send operation and the other performs a matching receive operation  Ideally, every send operation would be perfectly synchronized with its matching receive  This is rarely the case. Somehow or other, the MPI implementation must be able to deal with storing data when the two tasks are out of sync 39 Processor1Processor2 Network sendA recvA

Two Cases  Consider the following two cases: 1.A send operation occurs 5 seconds before the receive is ready - where is the message stored while the receive is pending? 2.Multiple sends arrive at the same receiving task which can only accept one send at a time - what happens to the messages that are "backing up"? 40

Steps Involved in Point-to-Point Communication 1.The data is copied to the user buffer by the user 2.The user calls one of the MPI send routines 3.The system copies the data from the user buffer to the system buffer 4.The system sends the data from the system buffer to the destination process 1.The user calls one of the MPI receive routines 2.The system receives the data from the source process and copies it to the system buffer 3.The system copies data from the system buffer to the user buffer 4.The user uses data in the user buffer sendbuf sysbuf Call a send routine Now sendbuf can be reused Copying data from sendbuf to sysbuf Send data from sysbuf to destination recvbuf sysbuf Call a recev routine Now recvbuf contains valid data Copying data from sysbuf to recvbuf Receive data from source to sysbuf Process 0 User ModeKernel Mode Process 1 User ModeKernel Mode Data Sender Receiver

Blocking Send and Receive  When we use point-to-point communication routines, we usually distinguish between blocking and non-blocking communication  A blocking send routine will only return after it is safe to modify the application buffer for reuse  Safe means that modifications will not affect the data intended for the receive task  This does not imply that the data was actually received by the receiver- it may be sitting in the system buffer at the sender side 42 Rank 0Rank 1 sendbuf recvbuf sendbuf Safe to modify sendbuf Network

Blocking Send and Receive  A blocking send can be: 1.Synchronous: Means there is a handshaking occurring with the receive task to confirm a safe send 2.Asynchronous: Means the system buffer at the sender side is used to hold the data for eventual delivery to the receiver  A blocking receive only returns after the data has arrived (i.e., stored at the application recvbuf) and is ready for use by the program 43

Non-Blocking Send and Receive (1)  Non-blocking send and non-blocking receive routines behave similarly  They return almost immediately  They do not wait for any communication events to complete such as:  Message copying from user buffer to system buffer  Or the actual arrival of a message 44

Non-Blocking Send and Receive (2)  However, it is unsafe to modify the application buffer until you make sure that the requested non-blocking operation was actually performed by the library  If you use the application buffer before the copy completes:  Incorrect data may be copied to the system buffer (in case of non-blocking send)  Or your receive buffer does not contain what you want (in case of non-blocking receive)  You can make sure of the completion of the copy by using MPI_WAIT() after the send or receive operations 45

Why Non-Blocking Communication?  Why do we use non-blocking communication despite its complexity?  Non-blocking communication is generally faster than its corresponding blocking communication  We can overlap computations while the system is copying data back and forth between application and system buffers 46

MPI Point-To-Point Communication Routines 47 RoutineSignature Blocking send int MPI_Send( void *buf, int count, MPI_Datatype datatype, int dest, int tag, MPI_Comm comm ) Non-blocking send int MPI_Isend( void *buf, int count, MPI_Datatype datatype, int dest, int tag, MPI_Comm comm, MPI_Request *request ) Blocking receive int MPI_Recv( void *buf, int count, MPI_Datatype datatype, int source, int tag, MPI_Comm comm, MPI_Status *status ) Non-blocking receive int MPI_Irecv( void *buf, int count, MPI_Datatype datatype, int source, int tag, MPI_Comm comm, MPI_Request *request ) RoutineSignature Blocking send int MPI_Send( void *buf, int count, MPI_Datatype datatype, int dest, int tag, MPI_Comm comm ) Non-blocking send int MPI_Isend( void *buf, int count, MPI_Datatype datatype, int dest, int tag, MPI_Comm comm, MPI_Request *request ) Blocking receive int MPI_Recv( void *buf, int count, MPI_Datatype datatype, int source, int tag, MPI_Comm comm, MPI_Status *status ) Non-blocking receive int MPI_Irecv( void *buf, int count, MPI_Datatype datatype, int source, int tag, MPI_Comm comm, MPI_Request *request ) RoutineSignature Blocking send int MPI_Send( void *buf, int count, MPI_Datatype datatype, int dest, int tag, MPI_Comm comm ) Non-blocking send int MPI_Isend( void *buf, int count, MPI_Datatype datatype, int dest, int tag, MPI_Comm comm, MPI_Request *request ) Blocking receive int MPI_Recv( void *buf, int count, MPI_Datatype datatype, int source, int tag, MPI_Comm comm, MPI_Status *status ) Non-blocking receive int MPI_Irecv( void *buf, int count, MPI_Datatype datatype, int source, int tag, MPI_Comm comm, MPI_Request *request ) RoutineSignature Blocking send int MPI_Send( void *buf, int count, MPI_Datatype datatype, int dest, int tag, MPI_Comm comm ) Non-blocking send int MPI_Isend( void *buf, int count, MPI_Datatype datatype, int dest, int tag, MPI_Comm comm, MPI_Request *request ) Blocking receive int MPI_Recv( void *buf, int count, MPI_Datatype datatype, int source, int tag, MPI_Comm comm, MPI_Status *status ) Non-blocking receive int MPI_Irecv( void *buf, int count, MPI_Datatype datatype, int source, int tag, MPI_Comm comm, MPI_Request *request ) RoutineSignature Blocking send int MPI_Send( void *buf, int count, MPI_Datatype datatype, int dest, int tag, MPI_Comm comm ) Non-blocking send int MPI_Isend( void *buf, int count, MPI_Datatype datatype, int dest, int tag, MPI_Comm comm, MPI_Request *request ) Blocking receive int MPI_Recv( void *buf, int count, MPI_Datatype datatype, int source, int tag, MPI_Comm comm, MPI_Status *status ) Non-blocking receive int MPI_Irecv( void *buf, int count, MPI_Datatype datatype, int source, int tag, MPI_Comm comm, MPI_Request *request )

Message Order  MPI guarantees that messages will not overtake each other  If a sender sends two messages M1 and M2 in succession to the same destination, and both match the same receive, the receive operation will receive M1 before M2  If a receiver posts two receives R1 and R2, in succession, and both are looking for the same message, R1 will receive the message before R2 48

Fairness  MPI does not guarantee fairness – it is up to the programmer to prevent operation starvation  For instance, if task 0 and task 1 send competing messages (i.e., messages that match the same receive) to task 2, only one of the sends will complete 49 Task 0 Msg A Task 1 Msg A Task 2 ?

Unidirectional Communication  When you send a message from process 0 to process 1, there are four combinations of MPI subroutines to choose from 1.Blocking send and blocking receive 2.Non-blocking send and blocking receive 3.Blocking send and non-blocking receive 4.Non-blocking send and non-blocking receive 50 Rank 0Rank 1 sendbuf recvbuf sendbuf

Bidirectional Communication  When two processes exchange data with each other, there are essentially 3 cases to consider:  Case 1: Both processes call the send routine first, and then receive  Case 2: Both processes call the receive routine first, and then send  Case 3: One process calls send and receive routines in this order, and the other calls them in the opposite order 51 Rank 0Rank 1 sendbuf recvbuf sendbuf

Bidirectional Communication- Deadlocks  With bidirectional communication, we have to be careful about deadlocks  When a deadlock occurs, processes involved in the deadlock will not proceed any further  Deadlocks can take place: 1.Either due to the incorrect order of send and receive 2.Or due to the limited size of the system buffer 52 Rank 0Rank 1 sendbuf recvbuf sendbuf

Case 1. Send First and Then Receive  Consider the following two snippets of pseudo-code:  MPI_ISEND immediately followed by MPI_WAIT is logically equivalent to MPI_SEND 53 IF (myrank==0) THEN CALL MPI_SEND(sendbuf, …) CALL MPI_RECV(recvbuf, …) ELSEIF (myrank==1) THEN CALL MPI_SEND(sendbuf, …) CALL MPI_RECV(recvbuf, …) ENDIF IF (myrank==0) THEN CALL MPI_ISEND(sendbuf, …, ireq, …) CALL MPI_WAIT(ireq, …) CALL MPI_RECV(recvbuf, …) ELSEIF (myrank==1) THEN CALL MPI_ISEND(sendbuf, …, ireq, …) CALL MPI_WAIT(ireq, …) CALL MPI_RECV(recvbuf, …) ENDIF

Case 1. Send First and Then Receive  What happens if the system buffer is larger than the send buffer?  What happens if the system buffer is not larger than the send buffer? Rank 0Rank 1 sendbuf sysbuf sendbuf sysbuf recvbuf Network Rank 0Rank 1 sendbuf sysbuf sendbuf sysbuf recvbuf Network DEADLOCK!

Case 1. Send First and Then Receive  Consider the following pseudo-code:  The code is free from deadlock because:  The program immediately returns from MPI_ISEND and starts receiving data from the other process  In the meantime, data transmission is completed and the calls of MPI_WAIT for the completion of send at both processes do not lead to a deadlock IF (myrank==0) THEN CALL MPI_ISEND(sendbuf, …, ireq, …) CALL MPI_RECV(recvbuf, …) CALL MPI_WAIT(ireq, …) ELSEIF (myrank==1) THEN CALL MPI_ISEND(sendbuf, …, ireq, …) CALL MPI_RECV(recvbuf, …) CALL MPI_WAIT(ireq, …) ENDIF

Case 2. Receive First and Then Send  Would the following pseudo-code lead to a deadlock?  A deadlock will occur regardless of how much system buffer we have  What if we use MPI_ISEND instead of MPI_SEND?  Deadlock still occurs IF (myrank==0) THEN CALL MPI_RECV(recvbuf, …) CALL MPI_SEND(sendbuf, …) ELSEIF (myrank==1) THEN CALL MPI_RECV(recvbuf, …) CALL MPI_ISEND(sendbuf, …) ENDIF

Case 2. Receive First and Then Send  What about the following pseudo-code?  It can be safely executed IF (myrank==0) THEN CALL MPI_IRECV(recvbuf, …, ireq, …) CALL MPI_SEND(sendbuf, …) CALL MPI_WAIT(ireq, …) ELSEIF (myrank==1) THEN CALL MPI_IRECV(recvbuf, …, ireq, …) CALL MPI_SEND(sendbuf, …) CALL MPI_WAIT(ireq, …) ENDIF

Case 3. One Process Sends and Receives; the other Receives and Sends  What about the following code?  It is always safe to order the calls of MPI_(I)SEND and MPI_(I)RECV at the two processes in an opposite order  In this case, we can use either blocking or non-blocking subroutines IF (myrank==0) THEN CALL MPI_SEND(sendbuf, …) CALL MPI_RECV(recvbuf, …) ELSEIF (myrank==1) THEN CALL MPI_RECV(recvbuf, …) CALL MPI_SEND(sendbuf, …) ENDIF

A Recommendation  Considering the previous options, performance, and the avoidance of deadlocks, it is recommended to use the following code: IF (myrank==0) THEN CALL MPI_ISEND(sendbuf, …, ireq1, …) CALL MPI_IRECV(recvbuf, …, ireq2, …) ELSEIF (myrank==1) THEN CALL MPI_ISEND(sendbuf, …, ireq1, …) CALL MPI_IRECV(recvbuf, …, ireq2, …) ENDIF CALL MPI_WAIT(ireq1, …) CALL MPI_WAIT(ireq2, …)

Message Passing Interface  In this part, the following concepts of MPI will be described:  Basics  Point-to-point communication  Collective communication 60

Collective Communication  Collective communication allows you to exchange data among a group of processes  It must involve all processes in the scope of a communicator  The communicator argument in a collective communication routine should specify which processes are involved in the communication  Hence, it is the programmer's responsibility to ensure that all processes within a communicator participate in any collective operation 61

Patterns of Collective Communication  There are several patterns of collective communication: 1.Broadcast 2.Scatter 3.Gather 4.Allgather 5.Alltoall 6.Reduce 7.Allreduce 8.Scan 9.Reducescatter 62

1. Broadcast  Broadcast sends a message from the process with rank root to all other processes in the group 63 A P0 P1 P2 P3 Data Process Broadcast A A A A P0 P1 P2 P3 Data Process int MPI_Bcast ( void *buffer, int count, MPI_Datatype datatype, int root, MPI_Comm comm )

2-3. Scatter and Gather  Scatter distributes distinct messages from a single source task to each task in the group  Gather gathers distinct messages from each task in the group to a single destination task ABCD P0 P1 P2 P3 Data Process Scatter A B C D P0 P1 P2 P3 Data Process int MPI_Scatter ( void *sendbuf, int sendcnt, MPI_Datatype sendtype, void *recvbuf, int recvcnt, MPI_Datatype recvtype, int root, MPI_Comm comm ) Gather int MPI_Gather ( void *sendbuf, int sendcnt, MPI_Datatype sendtype, void *recvbuf, int recvcount, MPI_Datatype recvtype, int root, MPI_Comm comm )

4. All Gather  Allgather gathers data from all tasks and distribute them to all tasks. Each task in the group, in effect, performs a one-to-all broadcasting operation within the group A B C D P0 P1 P2 P3 Data Process allgather ABCD ABCD ABCD ABCD P0 P1 P2 P3 Data Process int MPI_Allgather ( void *sendbuf, int sendcount, MPI_Datatype sendtype, void *recvbuf, int recvcount, MPI_Datatype recvtype, MPI_Comm comm )

5. All To All  With Alltoall, each task in a group performs a scatter operation, sending a distinct message to all the tasks in the group in order by index A0A1A2A3 B0B1B2B3 C0C1C2C3 D0D1D2D3 P0 P1 P2 P3 Data Process Alltoall int MPI_Alltoall( void *sendbuf, int sendcount, MPI_Datatype sendtype, void *recvbuf, int recvcnt, MPI_Datatype recvtype, MPI_Comm comm ) A0B0C0D0 A1B1C1D1 A2B2C2D2 A3B3C3D3 P0 P1 P2 P3 Data Process

6-7. Reduce and All Reduce  Reduce applies a reduction operation on all tasks in the group and places the result in one task  Allreduce applies a reduction operation and places the result in all tasks in the group. This is equivalent to an MPI_Reduce followed by an MPI_Bcast A B C D P0 P1 P2 P3 Data Process Reduce int MPI_Reduce ( void *sendbuf, void *recvbuf, int count, MPI_Datatype datatype, MPI_Op op, int root, MPI_Comm comm ) int MPI_Allreduce ( void *sendbuf, void *recvbuf, int count, MPI_Datatype datatype, MPI_Op op, MPI_Comm comm ) A*B*C*D P0 P1 P2 P3 Data Process A B C D P0 P1 P2 P3 Data Process Allreduce A*B*C*D P0 P1 P2 P3 Data Process

8. Scan  Scan computes the scan (partial reductions) of data on a collection of processes A B C D P0 P1 P2 P3 Data Process Scan int MPI_Scan ( void *sendbuf, void *recvbuf, int count, MPI_Datatype datatype, MPI_Op op, MPI_Comm comm ) P0 P1 P2 P3 Data Process A A*B A*B*C A*B*C*D

9. Reduce Scatter  Reduce Scatter combines values and scatters the results. It is equivalent to an MPI_Reduce followed by an MPI_Scatter operation. int MPI_Reduce_scatter ( void *sendbuf, void *recvbuf, int *recvcnts, MPI_Datatype datatype, MPI_Op op, MPI_Comm comm ) Reduce Scatter A0*B0*C0*D0 A1*B1*C1*D1 A2*B2*C2*D2 A3*B3*C3*D3 P0 P1 P2 P3 Data Process A0A1A2A3 B0B1B2B3 C0C1C2C3 D0D1D2D3 P0 P1 P2 P3 Data Process

Considerations and Restrictions  Collective operations are blocking  Collective communication routines do not take message tag arguments  Collective operations within subsets of processes are accomplished by first partitioning the subsets into new groups and then attaching the new groups to new communicators 70