1 Friday, October 13, 2006 The biggest difference between time and space is that you can't reuse time. -M. Furst.

Slides:



Advertisements
Similar presentations
Its.unc.edu 1 Collective Communication University of North Carolina - Chapel Hill ITS - Research Computing Instructor: Mark Reed
Advertisements

MPI Basics Introduction to Parallel Programming and Cluster Computing University of Washington/Idaho State University MPI Basics Charlie Peck Earlham College.
1 Buffers l When you send data, where does it go? One possibility is: Process 0Process 1 User data Local buffer the network User data Local buffer.
MPI_Gatherv CISC372 Fall 2006 Andrew Toy Tom Lynch Bill Meehan.
A Message Passing Standard for MPP and Workstations Communications of the ACM, July 1996 J.J. Dongarra, S.W. Otto, M. Snir, and D.W. Walker.
Distributed Memory Programming with MPI. What is MPI? Message Passing Interface (MPI) is an industry standard message passing system designed to be both.
MPI Point-to-Point Communication CS 524 – High-Performance Computing.
Distributed Systems CS Programming Models- Part II Lecture 17, Nov 2, 2011 Majd F. Sakr, Mohammad Hammoud andVinay Kolar 1.
1 Tuesday, October 10, 2006 To err is human, and to blame it on a computer is even more so. -Robert Orben.
CS 179: GPU Programming Lecture 20: Cross-system communication.
A Message Passing Standard for MPP and Workstations Communications of the ACM, July 1996 J.J. Dongarra, S.W. Otto, M. Snir, and D.W. Walker.
ORNL is managed by UT-Battelle for the US Department of Energy Crash Course In Message Passing Interface Adam Simpson NCCS User Assistance.
Parallel Processing1 Parallel Processing (CS 676) Lecture 7: Message Passing using MPI * Jeremy R. Johnson *Parts of this lecture was derived from chapters.
2a.1 Message-Passing Computing More MPI routines: Collective routines Synchronous routines Non-blocking routines ITCS 4/5145 Parallel Computing, UNC-Charlotte,
1 MPI: Message-Passing Interface Chapter 2. 2 MPI - (Message Passing Interface) Message passing library standard (MPI) is developed by group of academics.
Specialized Sending and Receiving David Monismith CS599 Based upon notes from Chapter 3 of the MPI 3.0 Standard
Parallel Programming with MPI Prof. Sivarama Dandamudi School of Computer Science Carleton University.
CS 838: Pervasive Parallelism Introduction to MPI Copyright 2005 Mark D. Hill University of Wisconsin-Madison Slides are derived from an online tutorial.
MPI Communications Point to Point Collective Communication Data Packaging.
Message Passing Programming Model AMANO, Hideharu Textbook pp. 140-147.
Summary of MPI commands Luis Basurto. Large scale systems Shared Memory systems – Memory is shared among processors Distributed memory systems – Each.
MPI Introduction to MPI Commands. Basics – Send and Receive MPI is a message passing environment. The processors’ method of sharing information is NOT.
MPI Send/Receive Blocked/Unblocked Tom Murphy Director of Contra Costa College High Performance Computing Center Message Passing Interface BWUPEP2011,
1 Overview on Send And Receive routines in MPI Kamyar Miremadi November 2004.
11/04/2010CS4961 CS4961 Parallel Programming Lecture 19: Message Passing, cont. Mary Hall November 4,
Parallel Programming with MPI By, Santosh K Jena..
MA471Fall 2002 Lecture5. More Point To Point Communications in MPI Note: so far we have covered –MPI_Init, MPI_Finalize –MPI_Comm_size, MPI_Comm_rank.
Parallel Programming & Cluster Computing MPI Collective Communications Dan Ernst Andrew Fitz Gibbon Tom Murphy Henry Neeman Charlie Peck Stephen Providence.
CSCI-455/522 Introduction to High Performance Computing Lecture 4.
1 Message Passing Models CEG 4131 Computer Architecture III Miodrag Bolic.
CSC 7600 Lecture 8 : MPI2 Spring 2011 HIGH PERFORMANCE COMPUTING: MODELS, METHODS, & MEANS MESSAGE PASSING INTERFACE MPI (PART B) Prof. Thomas Sterling.
Introduction to MPI CDP 1. Shared Memory vs. Message Passing Shared Memory Implicit communication via memory operations (load/store/lock) Global address.
MPI Point to Point Communication CDP 1. Message Passing Definitions Application buffer Holds the data for send or receive Handled by the user System buffer.
1. 2 The logical view of a machine supporting the message-passing paradigm consists of p processes, each with its own exclusive address space. The logical.
Message Passing Interface (MPI) 2 Amit Majumdar Scientific Computing Applications Group San Diego Supercomputer Center Tim Kaiser (now at Colorado School.
MPI Send/Receive Blocked/Unblocked Josh Alexander, University of Oklahoma Ivan Babic, Earlham College Andrew Fitz Gibbon, Shodor Education Foundation Inc.
Chapter 5. Nonblocking Communication MPI_Send, MPI_Recv are blocking operations Will not return until the arguments to the functions can be safely modified.
-1.1- MPI Lectured by: Nguyễn Đức Thái Prepared by: Thoại Nam.
1 Parallel and Distributed Processing Lecture 5: Message-Passing Computing Chapter 2, Wilkinson & Allen, “Parallel Programming”, 2 nd Ed.
Message Passing Programming Based on MPI Collective Communication I Bora AKAYDIN
Message Passing Interface Using resources from
Lecture 3 Point-to-Point Communications Dr. Muhammad Hanif Durad Department of Computer and Information Sciences Pakistan Institute Engineering and Applied.
COMP7330/7336 Advanced Parallel and Distributed Computing MPI Programming: 1. Collective Operations 2. Overlapping Communication with Computation Dr. Xiao.
ITCS 4/5145 Parallel Computing, UNC-Charlotte, B
MPI_Alltoall By: Jason Michalske. What is MPI_Alltoall? Each process sends distinct data to each receiver. The Jth block of process I is received by process.
3/12/2013Computer Engg, IIT(BHU)1 MPI-2. POINT-TO-POINT COMMUNICATION Communication between 2 and only 2 processes. One sending and one receiving. Types:
Computer Science Department
Introduction to parallel computing concepts and technics
Auburn University COMP7330/7336 Advanced Parallel and Distributed Computing Message Passing Interface (cont.) Topologies.
CS4402 – Parallel Computing
MPI Point to Point Communication
MPI Message Passing Interface
Computer Science Department
Send and Receive.
CS 584.
Send and Receive.
ITCS 4/5145 Parallel Computing, UNC-Charlotte, B
Message Passing Models
Distributed Systems CS
CS 5334/4390 Spring 2017 Rogelio Long
Lecture 14: Inter-process Communication
A Message Passing Standard for MPP and Workstations
Message-Passing Computing More MPI routines: Collective routines Synchronous routines Non-blocking routines ITCS 4/5145 Parallel Computing, UNC-Charlotte,
Introduction to Parallel Computing with MPI
Barriers implementations
Hardware Environment VIA cluster - 8 nodes Blade Server – 5 nodes
Message-Passing Computing Message Passing Interface (MPI)
Hello, world in MPI #include <stdio.h> #include "mpi.h"
Computer Science Department
Hello, world in MPI #include <stdio.h> #include "mpi.h"
Presentation transcript:

1 Friday, October 13, 2006 The biggest difference between time and space is that you can't reuse time. -M. Furst

2 machinefile is a text file (also called boot schema file) containing the following: hpcc.lums.edu.pk compute-0-0.local compute-0-1.local compute-0-2.local compute-0-3.local compute-0-4.local compute-0-5.local compute-0-6.local

3 hpcc.lums.edu.pk compute-0-0.local compute-0-1.local compute-0-2.local compute-0-3.local compute-0-4.local compute-0-5.local compute-0-6.local §lamboot –v machinefile l Launches LAM runtime environment §mpirun –np 4 hello l launches 4 copies of hello §Scheduling of copies is implementation dependent. l LAM will schedule in a round-robin fashion on every node depending on the number of CPUs listed per node.

4 #include int main(int argc, char *argv[]) { int rank, size; MPI_Init(&argc, &argv); MPI_Comm_rank(MPI_COMM_WORLD, &rank); MPI_Comm_size(MPI_COMM_WORLD, &size); printf("Hello, world! I am %d of %d\n", rank, size); MPI_Finalize(); return 0; } #include int main(int argc, char *argv[]) { int rank, size; MPI_Init(&argc, &argv); MPI_Comm_rank(MPI_COMM_WORLD, &rank); MPI_Comm_size(MPI_COMM_WORLD, &size); printf("Hello, world! I am %d of %d\n", rank, size); MPI_Finalize(); return 0; } #include int main(int argc, char *argv[]) { int rank, size; MPI_Init(&argc, &argv); MPI_Comm_rank(MPI_COMM_WORLD, &rank); MPI_Comm_size(MPI_COMM_WORLD, &size); printf("Hello, world! I am %d of %d\n", rank, size); MPI_Finalize(); return 0; } #include int main(int argc, char *argv[]) { int rank, size; MPI_Init(&argc, &argv); MPI_Comm_rank(MPI_COMM_WORLD, &rank); MPI_Comm_size(MPI_COMM_WORLD, &size); printf("Hello, world! I am %d of %d\n", rank, size); MPI_Finalize(); return 0; } hpcc.lums.edu.pk compute-0-1.local compute-0-0.local compute-0-2.local

5 #include int main(int argc, char *argv[]) { int rank, size, namelen; char name[100]; MPI_Init(&argc, &argv); MPI_Comm_rank(MPI_COMM_WORLD, &rank); MPI_Comm_size(MPI_COMM_WORLD, &size); MPI_Get_processor_name(name, &namelen); printf("Rank:%d Name:%s\n", rank,name); MPI_Finalize(); return 0; }

6 mpirun -np 4 pname Rank:0 Name:hpcc.lums.edu.pk Rank:2 Name:compute-0-1.local Rank:1 Name:compute-0-0.local Rank:3 Name:compute-0-2.local

7 mpirun -np 4 pname Rank:0 Name:hpcc.lums.edu.pk Rank:2 Name:compute-0-1.local Rank:1 Name:compute-0-0.local Rank:3 Name:compute-0-2.local Processes on remote nodes have their stdout redirected to that of mpirun

8 mpirun -np 8 pname Rank:0 Name:hpcc.lums.edu.pk Rank:2 Name:compute-0-1.local Rank:1 Name:compute-0-0.local Rank:3 Name:compute-0-2.local Rank:4 Name:compute-0-3.local Rank:5 Name:compute-0-4.local Rank:6 Name:compute-0-5.local Rank:7 Name:compute-0-6.local

9 mpirun -np 16 pname Rank:0 Name:hpcc.lums.edu.pk Rank:8 Name:hpcc.lums.edu.pk Rank:1 Name:compute-0-0.local Rank:3 Name:compute-0-2.local Rank:11 Name:compute-0-2.local Rank:7 Name:compute-0-6.local Rank:4 Name:compute-0-3.local Rank:2 Name:compute-0-1.local Rank:5 Name:compute-0-4.local Rank:6 Name:compute-0-5.local Rank:9 Name:compute-0-0.local Rank:15 Name:compute-0-6.local Rank:12 Name:compute-0-3.local Rank:10 Name:compute-0-1.local Rank:13 Name:compute-0-4.local Rank:14 Name:compute-0-5.local

10 Suppose boot schema file contains: hpcc.lums.edu.pk cpu=2 compute-0-0.local cpu=2 compute-0-1.local cpu=2 compute-0-2.local cpu=2 compute-0-3.local cpu=2 compute-0-4.local cpu=2 compute-0-5.local cpu=2 compute-0-6.local cpu=2

11 mpirun -np 8 pname Rank:0 Name:hpcc.lums.edu.pk Rank:1 Name:hpcc.lums.edu.pk Rank:4 Name:compute-0-1.local Rank:2 Name:compute-0-0.local Rank:6 Name:compute-0-2.local Rank:3 Name:compute-0-0.local Rank:7 Name:compute-0-2.local Rank:5 Name:compute-0-1.local

12 mpirun -np 16 pname Rank:0 Name:hpcc.lums.edu.pk Rank:1 Name:hpcc.lums.edu.pk Rank:8 Name:compute-0-3.local Rank:2 Name:compute-0-0.local Rank:6 Name:compute-0-2.local Rank:10 Name:compute-0-4.local Rank:14 Name:compute-0-6.local Rank:4 Name:compute-0-1.local Rank:12 Name:compute-0-5.local Rank:3 Name:compute-0-0.local Rank:7 Name:compute-0-2.local Rank:9 Name:compute-0-3.local Rank:13 Name:compute-0-5.local Rank:11 Name:compute-0-4.local Rank:15 Name:compute-0-6.local Rank:5 Name:compute-0-1.local

13 §mpirun C hello l Launch one copy of hello on every CPU that was listed in the boot schema §mpirun N hello l Launch one copy of hello on every node in the LAM universe (disregards CPU count)

14 int main(int argc, char *argv[]) { int rank, size; MPI_Status status; MPI_Init(&argc, &argv); MPI_Comm_rank(MPI_COMM_WORLD, &rank); MPI_Comm_size(MPI_COMM_WORLD, &size); if(rank %2==0){ printf("Rank:%d, I am EVEN\n", rank); } else { printf("Rank:%d, I am ODD\n", rank); } MPI_Finalize(); return 0; }

15 mpirun -np 8 rpdt Rank:0, I am EVEN Rank:2, I am EVEN Rank:1, I am ODD Rank:5, I am ODD Rank:3, I am ODD Rank:7, I am ODD Rank:6, I am EVEN Rank:4, I am EVEN

16 Point to point communication MPI_Send (void* buf, int count, MPI_Datatype datatype, int dest, int tag, MPI_Comm comm ) MPI_Recv (void* buf, int count, MPI_Datatype datatype, int source, int tag, MPI_Comm comm, MPI_Status *status )

17 int main(int argc, char *argv[]){ int rank, size, source=0, dest=1, tag=12; float sent=23.65, recv; MPI_Status status; MPI_Init(&argc, &argv); MPI_Comm_rank(MPI_COMM_WORLD, &rank); MPI_Comm_size(MPI_COMM_WORLD, &size); if(rank ==0){ MPI_Send(&sent, 1, MPI_FLOAT, dest, tag, MPI_COMM_WORLD); printf("I am %d of %d Sent %f\n", rank, size, sent); } else { MPI_Recv(&recv, 1, MPI_FLOAT, source, tag, MPI_COMM_WORLD, &status); printf("I am %d of %d Received %f\n",rank,size,recv); } MPI_Finalize(); return 0; }

18 lamboot -v mf2 LAM 7.1.1/MPI 2 C++/ROMIO - Indiana University n-1 ssi:boot:base:linear: booting n0 (hpcc.lums.edu.pk) n-1 ssi:boot:base:linear: booting n1 (compute-0-0.local) n-1 ssi:boot:base:linear: finished mpirun -np 2 sendrecv I am 0 of 2 Sent I am 1 of 2 Received mf2 is a text file containing the following: hpcc.lums.edu.pk compute-0-0.local

19 lamboot -v mf2 LAM 7.1.1/MPI 2 C++/ROMIO - Indiana University n-1 ssi:boot:base:linear: booting n0 (hpcc.lums.edu.pk) n-1 ssi:boot:base:linear: booting n1 (compute-0-0.local) n-1 ssi:boot:base:linear: finished mpirun -np 2 sendrecv I am 0 of 2 Sent I am 1 of 2 Received mf2 is a text file containing the following: hpcc.lums.edu.pk compute-0-0.local What will happen if I use np >2 ? What will happen if I use np = 1 ?

20 §MPI_Recv is a blocking receive operation §MPI allows two different implementations for MPI_Send : buffered and un-buffered. §MPI programs must be able to run correctly regardless of which of the two methods is used for implementing MPI_Send. §Such programs are called safe.

21 int a[10], b[10], myrank; MPI_Status status;... MPI_Comm_rank(MPI_COMM_WORLD, &myrank); if (myrank == 0) { MPI_Send(a, 10, MPI_INT, 1, 1, MPI_COMM_WORLD); MPI_Send(b, 10, MPI_INT, 1, 2, MPI_COMM_WORLD); } else if (myrank == 1) { MPI_Recv(b, 10, MPI_INT, 0, 2, MPI_COMM_WORLD); MPI_Recv(a, 10, MPI_INT, 0, 1, MPI_COMM_WORLD); }... Note: count entries of datatype

22 Avoiding Deadlocks int a[10], b[10], myrank; MPI_Status status;... MPI_Comm_rank(MPI_COMM_WORLD, &myrank); if (myrank == 0) { MPI_Send(a, 10, MPI_INT, 1, 1, MPI_COMM_WORLD); MPI_Send(b, 10, MPI_INT, 1, 2, MPI_COMM_WORLD); } else if (myrank == 1) { MPI_Recv(b, 10, MPI_INT, 0, 2, MPI_COMM_WORLD); MPI_Recv(a, 10, MPI_INT, 0, 1, MPI_COMM_WORLD); }... If MPI_Send is blocking nonbuffered, there is a deadlock.

23 int main(int argc, char *argv[]) { int rank, size, source=0, dest=1; float sent[5]={10,20,30,40,50}; float recv; MPI_Status status; MPI_Init(&argc, &argv); MPI_Comm_rank(MPI_COMM_WORLD, &rank); MPI_Comm_size(MPI_COMM_WORLD, &size);

24 if(rank ==0){ MPI_Send(&sent[0], 1, MPI_FLOAT, dest, 12, MPI_COMM_WORLD); printf("Rank:%d Sent %f\n", rank, sent[0]); MPI_Send(&sent[1], 1, MPI_FLOAT, dest, 13, MPI_COMM_WORLD); printf("Rank:%d Sent %f\n", rank, sent[1]); MPI_Send(&sent[2], 1, MPI_FLOAT, dest, 14, MPI_COMM_WORLD); printf("Rank:%d Sent %f\n", rank, sent[2]); } else { MPI_Recv(&recv, 1, MPI_FLOAT, source, 12, MPI_COMM_WORLD, &status); printf("Rank:%d Received %f\n", rank, recv); MPI_Recv(&recv, 1, MPI_FLOAT, source, 13, MPI_COMM_WORLD, &status); printf("Rank:%d Received %f\n", rank, recv); MPI_Recv(&recv, 1, MPI_FLOAT, source, 14, MPI_COMM_WORLD, &status); printf("Rank:%d Received %f\n", rank, recv); } MPI_Finalize(); return 0; }

25 Rank:0 Sent Rank:0 Sent Rank:0 Sent Rank:1 Received Rank:1 Received Rank:1 Received

26 if(rank ==0){ MPI_Send(&sent[0], 1, MPI_FLOAT, dest, 14, MPI_COMM_WORLD); printf("Rank:%d Sent %f\n", rank, sent[0]); MPI_Send(&sent[1], 1, MPI_FLOAT, dest, 13, MPI_COMM_WORLD); printf("Rank:%d Sent %f\n", rank, sent[1]); MPI_Send(&sent[2], 1, MPI_FLOAT, dest, 12, MPI_COMM_WORLD); printf("Rank:%d Sent %f\n", rank, sent[2]); } else { MPI_Recv(&recv, 1, MPI_FLOAT, source, 12, MPI_COMM_WORLD, &status); printf("Rank:%d Received %f\n", rank, recv); MPI_Recv(&recv, 1, MPI_FLOAT, source, 13, MPI_COMM_WORLD, &status); printf("Rank:%d Received %f\n", rank, recv); MPI_Recv(&recv, 1, MPI_FLOAT, source, 14, MPI_COMM_WORLD, &status); printf("Rank:%d Received %f\n", rank, recv); } NOTE: Unsafe: depends on whether system buffering provided or not

27 Rank:0 Sent Rank:0 Sent Rank:0 Sent Rank:1 Received Rank:1 Received Rank:1 Received

28 NOTE: Unsafe: depends on whether system buffering provided or not if(rank ==0){ MPI_Send(&sent[0], 1, MPI_FLOAT, dest, 14, MPI_COMM_WORLD); printf("Rank:%d Sent %f\n", rank, sent[0]); MPI_Send(&sent[1], 1, MPI_FLOAT, dest, 14, MPI_COMM_WORLD); printf("Rank:%d Sent %f\n", rank, sent[1]); MPI_Send(&sent[2], 1, MPI_FLOAT, dest, 13, MPI_COMM_WORLD); printf("Rank:%d Sent %f\n", rank, sent[2]); MPI_Send(&sent[3], 1, MPI_FLOAT, dest, 12, MPI_COMM_WORLD); printf("Rank:%d Sent %f\n", rank, sent[3]); } else { MPI_Recv(&recv, 1, MPI_FLOAT, source, 12, MPI_COMM_WORLD, &status); printf("Rank:%d Received %f\n", rank, recv); MPI_Recv(&recv, 1, MPI_FLOAT, source, 13, MPI_COMM_WORLD, &status); printf("Rank:%d Received %f\n", rank, recv); MPI_Recv(&recv, 1, MPI_FLOAT, source, 14, MPI_COMM_WORLD, &status); printf("Rank:%d Received %f\n", rank, recv); MPI_Recv(&recv, 1, MPI_FLOAT, source, 14, MPI_COMM_WORLD, &status); printf("Rank:%d Received %f\n", rank, recv); }

29 Rank:0 Sent Rank:0 Sent Rank:0 Sent Rank:0 Sent Rank:1 Received Rank:1 Received Rank:1 Received Rank:1 Received

30 Sending and Receiving Messages §MPI allows specification of wildcard arguments for both source and tag.  If source is set to MPI_ANY_SOURCE, then any process of the communication domain can be the source of the message.  If tag is set to MPI_ANY_TAG, then messages with any tag are accepted. §On the receive side, the message must be of length equal to or less than the length field specified.

31 Example §Numerical Integration

32 Numerical Integration (Serial) #include main() { float integral, a, b, h, x; int n, i float f(float x); /* Function we're integrating */ printf("Enter a, b, and n\n"); scanf("%f %f %d", &a, &b, &n); h = (b-a)/n; integral = (f(a) + f(b))/2.0; x = a; for (i = 1; i <= n-1; i++) { x = x + h; integral = integral + f(x); } integral = integral*h; printf("With n = %d trapezoids, our estimate\n", n); printf("of the integral from %f to %f = %f\n", a, b, integral); } /* main */

33 Numerical Integration (Parallel) main(int argc, char** argv) { int my_rank; /* My process rank */ int p; /* The number of processes */ float a = 0.0; /* Left endpoint */ float b = 1.0; /* Right endpoint */ int n = 1024; /* Number of trapezoids */ float h; /* Trapezoid base length */ float local_a; /* Left endpoint my process */ float local_b; /* Right endpoint my process */ int local_n; /* Number of trapezoids for */ /* my calculation */ float integral; /* Integral over my interval */ float total; /* Total integral */ int source; /* Process sending integral */ int dest = 0; /* All messages go to 0 */ int tag = 0; MPI_Status status;

34 Numerical Integration (Parallel) MPI_Init(&argc, &argv); MPI_Comm_rank(MPI_COMM_WORLD, &my_rank); MPI_Comm_size(MPI_COMM_WORLD, &p); h = (b-a)/n; /* h is the same for all processes */ local_n = n/p; /* So is the number of trapezoids */ local_a = a + my_rank*local_n*h; local_b = local_a + local_n*h; integral = Trap(local_a, local_b, local_n, h); /* Add up the integrals calculated by each process */ if (my_rank == 0) { total = integral; for (source = 1; source < p; source++) { MPI_Recv(&integral, 1, MPI_FLOAT, source, tag, MPI_COMM_WORLD, &status); total = total + integral; }

35 Numerical Integration (Parallel) } else { MPI_Send(&integral, 1, MPI_FLOAT, dest, tag, MPI_COMM_WORLD); } /* Print the result */ if (my_rank == 0) { printf("With n = %d trapezoids, our estimate\n", n); printf("of the integral from %f to %f = %f\n", a, b, total); } /* Shut down MPI */ MPI_Finalize(); } /* main */

36 Numerical Integration (Parallel) float Trap(float local_a, float local_b, int local_n, float h) { float integral; /* Store result in integral */ float x; int i; float f(float x); /* function */ integral = (f(local_a) + f(local_b))/2.0; x = local_a; for (i = 1; i <= local_n-1; i++) { x = x + h; integral = integral + f(x); } integral = integral*h; return integral; }

37 Avoiding Deadlocks Consider the following piece of code, in which process i sends a message to process i + 1 (modulo the number of processes) and receives a message from process i - 1 (module the number of processes). int a[10], b[10], npes, myrank; MPI_Status status;... MPI_Comm_size(MPI_COMM_WORLD, &npes); MPI_Comm_rank(MPI_COMM_WORLD, &myrank); MPI_Send(a, 10, MPI_INT, (myrank+1)%npes, 1, MPI_COMM_WORLD); MPI_Recv(b, 10, MPI_INT, (myrank-1+npes)%npes, 1, MPI_COMM_WORLD);...

38 Avoiding Deadlocks Consider the following piece of code, in which process i sends a message to process i + 1 (modulo the number of processes) and receives a message from process i - 1 (module the number of processes). int a[10], b[10], npes, myrank; MPI_Status status;... MPI_Comm_size(MPI_COMM_WORLD, &npes); MPI_Comm_rank(MPI_COMM_WORLD, &myrank); MPI_Send(a, 10, MPI_INT, (myrank+1)%npes, 1, MPI_COMM_WORLD); MPI_Recv(b, 10, MPI_INT, (myrank-1+npes)%npes, 1, MPI_COMM_WORLD);... Once again, we have a deadlock if MPI_Send is blocking.

39 Avoiding Deadlocks We can break the circular wait to avoid deadlocks as follows: int a[10], b[10], npes, myrank; MPI_Status status;... MPI_Comm_size(MPI_COMM_WORLD, &npes); MPI_Comm_rank(MPI_COMM_WORLD, &myrank); if (myrank%2 == 1) { MPI_Send(a, 10, MPI_INT, (myrank+1)%npes, 1, MPI_COMM_WORLD); MPI_Recv(b, 10, MPI_INT, (myrank-1+npes)%npes, 1, MPI_COMM_WORLD); } else { MPI_Recv(b, 10, MPI_INT, (myrank-1+npes)%npes, 1, MPI_COMM_WORLD); MPI_Send(a, 10, MPI_INT, (myrank+1)%npes, 1, MPI_COMM_WORLD); }...

40 Sending and Receiving Messages Simultaneously To exchange messages, MPI provides the following function: int MPI_Sendrecv(void *sendbuf, int sendcount, MPI_Datatype senddatatype, int dest, int sendtag, void *recvbuf, int recvcount, MPI_Datatype recvdatatype, int source, int recvtag, MPI_Comm comm, MPI_Status *status)

41 All-to-All broadcast in Hypercube

42 All-to-All broadcast in Hypercube ,1 2,3 0,1 2,3 4,5 6,7 4,5

43 All-to-All broadcast in Hypercube ,1,2,3 4,5,6,7

44 All-to-All broadcast in Hypercube ,1,2,3,4,5,6,7

45

46 Possibility of deadlock if implemented as shown and system buffering not provided.

47 #include #define MAXMSG 100 #define SINGLEMSG 10 int main(int argc, char *argv[]) { int i,j, rank, size, bytes_read, d=3, nbytes=SINGLEMSG, partner, tag=11; char *result, *received; MPI_Status status; MPI_Init(&argc, &argv); MPI_Comm_rank(MPI_COMM_WORLD, &rank); MPI_Comm_size(MPI_COMM_WORLD, &size); received = (char *) malloc (SINGLEMSG + 1); result = (char *) malloc (MAXMSG); if (argc != (size+1)){ perror("Command line arguments missing"); MPI_Finalize(); exit(1); } strcpy(result, argv[rank+1]); for (i=0; i<d; i++){ partner = rank ^ (1<<i); MPI_Sendrecv(result, strlen(result)+1, MPI_CHAR, partner, tag, received, MAXMSG, MPI_CHAR, partner, tag, MPI_COMM_WORLD, &status); printf("I am node %d: Sent %s\t Received %s\n", rank, result, received); strcat(result, received); } printf("I am node %d: My final result is %s\n", rank, result); MPI_Finalize(); return 0; }

48 int main(int argc, char *argv[]){ // initializations MPI_Init(&argc, &argv); MPI_Comm_rank(MPI_COMM_WORLD, &rank); MPI_Comm_size(MPI_COMM_WORLD, &size); received = (char *) malloc (SINGLEMSG + 1); result = (char *) malloc (MAXMSG); //error checks strcpy(result, argv[rank+1]); for (i=0; i<d; i++){ partner = rank ^ (1<<i); MPI_Sendrecv(result, strlen(result)+1, MPI_CHAR, partner, tag, received, MAXMSG, MPI_CHAR, partner, tag, MPI_COMM_WORLD, &status); printf("I am node %d: Sent %s\t Received %s\n", rank, result, received); strcat(result, received); } printf("I am node %d: My final result is %s\n", rank, result); MPI_Finalize(); return 0; }

49 mpirun -np 8 hbroadcast "one " "two " "three " "four " "five " "six " "seven " "eight " I am node 0: Sent one Received two I am node 4: Sent five Received six I am node 5: Sent six Received five I am node 1: Sent two Received one I am node 3: Sent four Received three I am node 2: Sent three Received four I am node 0: Sent one two Received three four I am node 3: Sent four three Received two one I am node 7: Sent eight Received seven I am node 1: Sent two one Received four three I am node 2: Sent three four Received one two I am node 7: Sent eight seven Received six five I am node 6: Sent seven Received eight

50 I am node 0: Sent one two three four Received five six seven eight I am node 0: My final result is one two three four five six seven eight I am node 5: Sent six five Received eight seven I am node 6: Sent seven eight Received five six I am node 7: Sent eight seven six five Received four three two one I am node 3: Sent four three two one Received eight seven six five I am node 5: Sent six five eight seven Received two one four three I am node 4: Sent five six Received seven eight I am node 3: My final result is four three two one eight seven six five I am node 1: Sent two one four three Received six five eight seven I am node 1: My final result is two one four three six five eight seven I am node 5: My final result is six five eight seven two one four three I am node 6: Sent seven eight five six Received three four one two I am node 4: Sent five six seven eight Received one two three four I am node 4: My final result is five six seven eight one two three four I am node 2: Sent three four one two Received seven eight five six I am node 7: My final result is eight seven six five four three two one I am node 2: My final result is three four one two seven eight five six I am node 6: My final result is seven eight five six three four one two

51 mpirun -np 8 hbroadcast "one " "two " "three " "four " "five " "six " "seven " "eight " | grep "node 0 " I am node 0: Sent one Received two I am node 0: Sent one two Received three four I am node 0: Sent one two three four Received five six seven eight I am node 0: My final result is one two three four five six seven eight

52 mpirun -np 8 hbroadcast "one " "two " "three " "four " "five " "six " "seven " "eight " | grep "node 7 " I am node 7: Sent eight Received seven I am node 7: Sent eight seven Received six five I am node 7: Sent eight seven six five Received four three two one I am node 7: My final result is eight seven six five four three two one

53 MPI_Bsend Buffered blocking send §Permits the programmer to allocate the required amount of buffer space into which data can be copied until it is delivered. §Insulates against the problems associated with insufficient system buffer space. §Routine returns after the data has been copied from application buffer space to the allocated send buffer. §Must be used with the MPI_Buffer_attach routine.

54 MPI_Ssend Synchronous blocking send §Send a message and block until the application buffer in the sending task is free for reuse and the destination process has started to receive the message.

55 MPI_Rsend Blocking ready send. §Should only be used if the programmer is certain that the matching receive has already been posted.

56 Non-blocking: §Non-blocking: l Non-blocking send and receive routines behave similarly - they will return almost immediately. l They do not wait for any communication events to complete, such as message copying from user memory to system buffer space or the actual arrival of message. l It is unsafe to modify the application buffer until you know for a fact the requested non-blocking operation was actually performed by the library. There are "wait" routines used to do this. l Non-blocking communications are primarily used to overlap computation with communication and exploit possible performance gains.

57 Blocking non-bufferedNon-blocking non-buffered

58 §In case of non-blocking buffered send operation, it does not wait for message copying from user memory (application buffer) to system buffer space. §Must check when it is safe for the programmer to touch the application buffer.

59 Non-blocking: MPI_Isend §Identifies an area in memory to serve as a send buffer. §Processing continues immediately without waiting for the message to be copied out from the application buffer.

60 Non-blocking: MPI_Isend §A communication request handle is returned for handling the pending message status.  The program should not modify the application buffer until subsequent calls to MPI_Wait or MPI_Test indicate that the non-blocking send has completed. MPI_Isend (&buf,count,datatype,dest,tag,comm, &request)

61 Non-blocking: MPI_Irecv §Identifies an area in memory to serve as a receive buffer. §Processing continues immediately without actually waiting for the message to be received and copied into the application buffer.

62 Non-blocking: MPI_Irecv §A communication request handle is returned for handling the pending message status. The program must use calls to MPI_Wait or MPI_Test to determine when the non-blocking receive operation completes and the requested message is available in the application buffer. MPI_Irecv (&buf,count,datatype,source,tag,comm, &request)

63 MPI_Wait blocks until a specified non-blocking send or receive operation has completed. MPI_Wait (&request,&status) MPI_Waitany (count,&array_of_requests,&index,&status) MPI_Waitall (count,&array_of_requests,&array_of_statuses) MPI_Waitsome (incount,&array_of_requests,&outcount, &array_of_offsets, &array_of_statuses)

64 MPI_Test checks the status of a specified non-blocking send or receive operation. l The "flag" parameter is returned logical true (1) if the operation has completed, and logical false (0) if not. MPI_Test (&request,&flag,&status) MPI_Testany (count,&array_of_requests,&index,&flag,&status) MPI_Testall (count,&array_of_requests,&flag,&array_of_statuses) MPI_Testsome (incount,&array_of_requests,&outcount, &array_of_offsets, &array_of_statuses)

65 int main() { int numtasks, rank, next, prev, buf[2], tag1=1, tag2=2; MPI_Request reqs[4]; MPI_Status stats[4]; MPI_Init(&argc,&argv); MPI_Comm_size(MPI_COMM_WORLD, &numtasks); MPI_Comm_rank(MPI_COMM_WORLD, &rank); prev = rank-1; next = rank+1; if (rank == 0) prev = numtasks - 1; if (rank == (numtasks - 1)) next = 0; MPI_Irecv(&buf[0], 1, MPI_INT, prev, tag1, MPI_COMM_WORLD, &reqs[0]); MPI_Irecv(&buf[1], 1, MPI_INT, next, tag2, MPI_COMM_WORLD, &reqs[1]); MPI_Isend(&rank, 1, MPI_INT, prev, tag2, MPI_COMM_WORLD, &reqs[2]); MPI_Isend(&rank, 1, MPI_INT, next, tag1, MPI_COMM_WORLD, &reqs[3]); //{ do some work } MPI_Waitall(4, reqs, stats); MPI_Finalize(); }

66 MPI_Issend §Non-blocking synchronous send. Similar to MPI_Isend(), except MPI_Wait() or MPI_Test() indicates when the destination process has received the message. MPI_Ibsend §Non-blocking buffered send. Similar to MPI_Bsend() except MPI_Wait() or MPI_Test() indicates when the destination process has received the message. Must be used with the MPI_Buffer_attach routine. MPI_Irsend §Non-blocking ready send. Similar to MPI_Rsend() except MPI_Wait() or MPI_Test() indicates when the destination process has received the message. Should only be used if the programmer is certain that the matching receive has already been posted.

67 Collective communication §Collective communication must involve all processes in the scope of a communicator. §It is the programmer's responsibility to ensure that all processes within a communicator participate in any collective operations. §All processes in the communicator must specify the same source or target processes.

68 Collective communication MPI_Barrier §Creates a barrier synchronization in a group. §Each task, when reaching the MPI_Barrier call, blocks until all tasks in the group reach the same MPI_Barrier call. §MPI_Barrier (comm)

69 Collective communication MPI_Bcast §Broadcasts a message from one process to all other processes in the group.  MPI_Bcast (&buffer,count,datatype,source,comm) count and datatype must match on all processes

70 Collective communication MPI_Scatter  Distributes distinct messages from a single source task to each task in the group. MPI_Scatter (&sendbuf,sendcnt,sendtype,&recvbuf, recvcnt,recvtype,source,comm) Process i receives sendcnt contiguous elements starting from i*sendcnt location of sendbuf. sendcnt is the number of elements sent to each individual process.

71 Collective communication MPI_Gather  Gathers distinct messages from each task in the group to a single destination task. This routine is the reverse operation of MPI_Scatter. MPI_Gather (&sendbuf,sendcnt,sendtype,&recvbuf, recvcount,recvtype,root,comm) Data is stored in recvbuf in a rank order. Data from process i is stored in recvbuf at location i*sendcnt. Information about receive buffer is applicable to the recipient process and is ignored for all others.

72 Collective communication MPI_Allgather l Concatenation of data to all tasks in a group. Each task in the group, in effect, performs a one-to-all broadcasting operation within the group.

73 Collective communication MPI_Reduce l Applies a reduction operation on all tasks in the group and places the result in one task.

74

75 Collective communication MPI_Allreduce §Applies a reduction operation and places the result in all tasks in the group. This is equivalent to an MPI_Reduce followed by an MPI_Bcast.

76 Collective communication MPI_Alltoall §Each task in a group performs a scatter operation, sending a distinct message to all the tasks in the group in order by index.

77 #define SIZE 4 int main() { int numtasks, rank, sendcount, recvcount, source; float sendbuf[SIZE][SIZE] = { {1.0, 2.0, 3.0, 4.0}, {5.0, 6.0, 7.0, 8.0}, {9.0, 10.0, 11.0, 12.0}, {13.0, 14.0, 15.0, 16.0} }; float recvbuf[SIZE]; MPI_Init(&argc,&argv); MPI_Comm_rank(MPI_COMM_WORLD, &rank); MPI_Comm_size(MPI_COMM_WORLD, &numtasks); if (numtasks == SIZE) { source = 1; sendcount = SIZE; recvcount = SIZE; MPI_Scatter(sendbuf,sendcount,MPI_FLOAT, recvbuf, recvcount, MPI_FLOAT,source,MPI_COMM_WORLD); printf("rank= %d Results: %f %f %f %f\n",rank,recvbuf[0], recvbuf[1],recvbuf[2],recvbuf[3]); } else printf("Must specify %d processors. Terminating.\n",SIZE); MPI_Finalize(); }

78 mpirun -np 4 scatter rank= 0 Results: rank= 1 Results: rank= 2 Results: rank= 3 Results:

79 Collective communication MPI_Gatherv MPI_Allgatherv §Allows different number of data elements to be sent by each process MPI_Scatterv §Allows different amounts of data to be sent to different processes MPI_Alltoallv §Allows different amounts of data to be sent to and received from each process …