Introduction to MPI Programming

Slides:



Advertisements
Similar presentations
Its.unc.edu 1 Collective Communication University of North Carolina - Chapel Hill ITS - Research Computing Instructor: Mark Reed
Advertisements

MPI Collective Communications
Sahalu Junaidu ICS 573: High Performance Computing 8.1 Topic Overview Matrix-Matrix Multiplication Block Matrix Operations A Simple Parallel Matrix-Matrix.
1 Buffers l When you send data, where does it go? One possibility is: Process 0Process 1 User data Local buffer the network User data Local buffer.
HPDC Spring MPI 11 CSCI-6964: High Performance Parallel & Distributed Computing (HPDC) AE 216, Mon/Thurs. 2 – 3:20 p.m Message Passing Interface.
A Message Passing Standard for MPP and Workstations Communications of the ACM, July 1996 J.J. Dongarra, S.W. Otto, M. Snir, and D.W. Walker.
SOME BASIC MPI ROUTINES With formal datatypes specified.
Distributed Memory Programming with MPI. What is MPI? Message Passing Interface (MPI) is an industry standard message passing system designed to be both.
MPI Collective Communication CS 524 – High-Performance Computing.
MPI Point-to-Point Communication CS 524 – High-Performance Computing.
Collective Communication.  Collective communication is defined as communication that involves a group of processes  More restrictive than point to point.
Jonathan Carroll-Nellenback CIRC Summer School MESSAGE PASSING INTERFACE (MPI)
Distributed Systems CS Programming Models- Part II Lecture 17, Nov 2, 2011 Majd F. Sakr, Mohammad Hammoud andVinay Kolar 1.
Parallel Programming with Java
A Message Passing Standard for MPP and Workstations Communications of the ACM, July 1996 J.J. Dongarra, S.W. Otto, M. Snir, and D.W. Walker.
Parallel Programming with MPI Matthew Pratola
ORNL is managed by UT-Battelle for the US Department of Energy Crash Course In Message Passing Interface Adam Simpson NCCS User Assistance.
Parallel Processing1 Parallel Processing (CS 676) Lecture 7: Message Passing using MPI * Jeremy R. Johnson *Parts of this lecture was derived from chapters.
Parallel Programming and Algorithms – MPI Collective Operations David Monismith CS599 Feb. 10, 2015 Based upon MPI: A Message-Passing Interface Standard.
1 Collective Communications. 2 Overview  All processes in a group participate in communication, by calling the same function with matching arguments.
Specialized Sending and Receiving David Monismith CS599 Based upon notes from Chapter 3 of the MPI 3.0 Standard
Parallel Programming with MPI Prof. Sivarama Dandamudi School of Computer Science Carleton University.
Jonathan Carroll-Nellenback CIRC Summer School MESSAGE PASSING INTERFACE (MPI)
CS 838: Pervasive Parallelism Introduction to MPI Copyright 2005 Mark D. Hill University of Wisconsin-Madison Slides are derived from an online tutorial.
MPI Communications Point to Point Collective Communication Data Packaging.
Performance Oriented MPI Jeffrey M. Squyres Andrew Lumsdaine NERSC/LBNL and U. Notre Dame.
Summary of MPI commands Luis Basurto. Large scale systems Shared Memory systems – Memory is shared among processors Distributed memory systems – Each.
MPI Introduction to MPI Commands. Basics – Send and Receive MPI is a message passing environment. The processors’ method of sharing information is NOT.
MPI Send/Receive Blocked/Unblocked Tom Murphy Director of Contra Costa College High Performance Computing Center Message Passing Interface BWUPEP2011,
Parallel Programming with MPI By, Santosh K Jena..
Lecture 6: Message Passing Interface (MPI). Parallel Programming Models Message Passing Model Used on Distributed memory MIMD architectures Multiple processes.
Oct. 23, 2002Parallel Processing1 Parallel Processing (CS 730) Lecture 6: Message Passing using MPI * Jeremy R. Johnson *Parts of this lecture was derived.
MPI Jakub Yaghob. Literature and references Books Gropp W., Lusk E., Skjellum A.: Using MPI: Portable Parallel Programming with the Message-Passing Interface,
Its.unc.edu 1 University of North Carolina - Chapel Hill ITS Research Computing Instructor: Mark Reed Point to Point Communication.
MPI Point to Point Communication CDP 1. Message Passing Definitions Application buffer Holds the data for send or receive Handled by the user System buffer.
MPI Send/Receive Blocked/Unblocked Josh Alexander, University of Oklahoma Ivan Babic, Earlham College Andrew Fitz Gibbon, Shodor Education Foundation Inc.
-1.1- MPI Lectured by: Nguyễn Đức Thái Prepared by: Thoại Nam.
Message Passing Programming Based on MPI Collective Communication I Bora AKAYDIN
MPI Derived Data Types and Collective Communication
COMP7330/7336 Advanced Parallel and Distributed Computing MPI Programming: 1. Collective Operations 2. Overlapping Communication with Computation Dr. Xiao.
MPI_Alltoall By: Jason Michalske. What is MPI_Alltoall? Each process sends distinct data to each receiver. The Jth block of process I is received by process.
1 MPI: Message Passing Interface Prabhaker Mateti Wright State University.
Distributed Processing with MPI International Summer School 2015 Tomsk Polytechnic University Assistant Professor Dr. Sergey Axyonov.
Computer Science Department
Introduction to MPI Programming Ganesh C.N.
MPI Jakub Yaghob.
CS4402 – Parallel Computing
MPI Point to Point Communication
Computer Science Department
Send and Receive.
Collective Communication with MPI
An Introduction to Parallel Programming with MPI
Collective Communication Operations
Programming with MPI.
Send and Receive.
Distributed Systems CS
More on MPI Nonblocking point-to-point routines Deadlock
ITCS 4/5145 Parallel Computing, UNC-Charlotte, B
Distributed Systems CS
Lecture 14: Inter-process Communication
High Performance Parallel Programming
A Message Passing Standard for MPP and Workstations
MPI: Message Passing Interface
Message-Passing Computing More MPI routines: Collective routines Synchronous routines Non-blocking routines ITCS 4/5145 Parallel Computing, UNC-Charlotte,
More on MPI Nonblocking point-to-point routines Deadlock
Barriers implementations
Synchronizing Computations
Computer Science Department
5- Message-Passing Programming
Parallel Processing - MPI
Presentation transcript:

Introduction to MPI Programming (Part II)‏ Michael Griffiths, Deniz Savas & Alan Real January 2006

Overview Review point to point communications Collective Communication Data types Data packing Collective Communication Broadcast, Scatter & Gather of data Reduction Operations Barrier Synchronisation

Blocking operations Relate to when the operation has completed Only return from the subroutine call when the operation has completed

Non-blocking operations Return straight away and allow the sub-program to return to perform other work. At some time later the sub-program should test or wait for the completion of the non-blocking operation. A non-blocking operation immediately followed by a matching wait is equivalent to a blocking operation. Non-blocking operations are not the same as sequential subroutine calls as the operation continues after the call has returned.

Always completes (unless an error occurs), irrespective of receiver. MPI_Isend Begins a nonblocking send Blocking Send and Receive Ready send Standard send Buffered send Synchronous send Sender mode Completes when a message arrives and received by pair of processors. MPI_Sendrecv Always completes (unless an error occurs), irrespective of whether the receive has completed. MPI_Rsend Can be synchronous or buffered (often implementation dependent). MPI_Send MPI_Bsend Only completes when the receive has completed. MPI_Ssend Completion status MPI Call (F/C)‏

Non-blocking communication Separate communication into three phases: Initiate non-blocking communication Do some work: Perhaps involving other communications Wait for non-blocking communication to complete.

Non-blocking send 1 2 Receive MPI_COMM_WORLD 5 3 4 4 4 Send req Wait 2 Receive MPI_COMM_WORLD 5 3 4 4 4 Send req Wait Send is initiated and returns straight away. Sending process can do other things Can test later whether operation has completed.

Non-blocking receive 1 2 2 2 Rec req Wait MPI_COMM_WORLD 5 3 4 Send Rec req Wait MPI_COMM_WORLD 5 3 4 Send Receive is initiated and returns straight away. Receiving process can do other things Can test later whether operation has completed.

The Request Handle Same arguments as non-blocking call Additional request handle In C/C++ is of type MPI_Request/MPI::Request In Fortran is an INTEGER Request handle is allocated when a communication is initiated Can query to test whether non-blocking operation has completed

Non-blocking synchronous send Fortran: CALL MPI_ISSEND(buf, count, datatype, dest, tag, comm, request, error)‏ CALL MPI_WAIT(request, status, error)‏ C: MPI_Issend(&buf, count, datatype, dest, tag, comm, &request); MPI_Wait(&request, &status); C++: request = comm.Issend(&buf, count, datatype, dest, tag); request.Wait();

Non-blocking synchronous receive Fortran: CALL MPI_IRECV(buf, count, datatype, src, tag, comm, request, error)‏ CALL MPI_WAIT(request, status, error)‏ C: MPI_Irecv(&buf, count, datatype, src, tag, comm, &request); MPI_Wait(&request, &status); C++: request = comm.Irecv(&buf, count, datatype, src, tag); request.Wait(status);

Blocking v Non-blocking Send and receive can be blocking or non-blocking. A blocking send can be used with a non-blocking receive, and vice versa. Non-blocking sends can use any mode: Comm.Recv(…)‏ MPI_Recv(…)‏ Receive Comm.Rsend(…)‏ MPI_Rsend(…)‏ Ready send Comm.Bsend(…)‏ MPI_Bsend(…)‏ Buffered send Comm.Ssend(…)‏ MPI_Ssend(…)‏ Synchronous send Comm.Send(…)‏ MPI_Send(…)‏ Standard send C++ MPI call Fortran/C Operation Synchronous mode affects completion, not initiation. A non-blocking call followed by an explicit wait is identical to the corresponding blocking communication.

Completion Can either wait or test for completion: Fortran (LOGICAL flag): CALL MPI_WAIT(request, status, ierror)‏ CALL MPI_TEST(request, flag, status, ierror)‏ C (int flag): MPI_Wait(&request, &status)‏ MPI_Test(&request, &flag, &status)‏ C++ (bool flag): request.Wait()‏ flag = request.Test(); (for sends)‏ request.Wait(status); flag = request.Test(status); (for receives)‏

Other related wait and test routines If multiple non-blocking calls are issued … MPI_TESTANY : Tests if any one of ‘a list of requests’ (they could be send or receive requests) have been completed. MPI_WAITANY : Waits until any one of the list of requests have been completed. MPI_TESTALL : Test if all the requests in a list are completed. MPI_WAITALL : Waits until all the requests in a list are completed. MPI_PROBE , MPI_IPROBE : Allows for the incoming messages to be checked for without actually receiving them. Note that MPI_PROBE is blocking. It waits until there is something to probe for. MPI_CANCEL : Cancels pending communication. Last resort, clean- up operation ! All routines take an array of requests and can return an array of statuses. ‘any’ routines return an index of the completed operation

Merging send and receive operations into a single unit The following is the syntax of the MPI_Sendrecv command: IN C: int MPI_Sendrecv( void * sendbuf, int sendcount, MPI_Datatype sendtype, int dest, int sendtag, void* recvbuf, int recvcount, MPI_Datatye recvtype ,int source , int recvtag, MPI_Comm comm, MPI_Status *status )‏ IN FORTRAN <sendtype> sendbuf(:) <recvtype> recvbuf(:)‏ INTEGER sendcount,sendtype, dest, sendtag, recvcount, recvtype, INTEGER source, recvtag, comm, status(MPI_STATUS_SIZE), ierror MPI_SENDRECV( sendbuf,sendcount,sendtype, dest, sendtag, recvbuf, recvcount , recvtype , source, recvtag , comm , status , ierror )‏

Important Notes about MPI_Sendrecv Beware! A message sent by MPI_sendrecv is receivable by a regular receive operation if the destination and tag match. For the destination and source MPI_PROC_NULL can be specified to allow one directional working. (Useful in non-circular communication for the very end-nodes). Any communication with MPI_PROC_NULL returns immediately with no effect but as if the operation has been successful. This can make programming easier. The send and receive buffers must not overlap, they must be separate memory locations. This restriction can be avoided by using the MPI_Sendrecv_replace routine

Data Packing Up until now we have only seen contiguous data of pre-defined data-types being communicated by MPI calls. This can be rather restricting if what we are intending to transfer involves structures of data made up of mixtures of primitive data types, such as integer count followed by a sequence of real numbers. One solution to this problem is to use the MPI_PACK and MPI_UNPACK routines. The philosophy used is similar to the Fortran write/read to/from internal buffers and the scanf function in C. MPI_PACK routine can be called consecutively to compress the data into a send_buffer, the resulting buffer of data can then be sent by using MPI_SEND ‘or equivalent’ with the data_type set to MPI_PACKED. At the receiving-end it can be received by using MPI_RECV with the data type MPI_PACKED. The received data can then be unpacked by using MPI_UNPACK to recover the original packed data. This method of working can also improve communications efficiency by reducing the number of data transfer ‘send-receive’ calls. There are usually fixed overheads associated with setting up the communications that would cause inefficiencies if the sent/received messages are just too small.

MPI_Pack Fortran : <type> INBUF(:) , OUTBUF(:)‏ C : INTEGER INCOUNT,DATATYPE,OUTSIZE,POSITION,COMM,IERROR MPI_PACK(INBUF,INCOUNT,DATATYPE,OUTBUF, OUTSIZE,POSITION, COMM,IERROR )‏ C : int MPI_Pack(void* inbuf, int incount, MPI_Datatype datatype, void *outbuf ,int outsize, int *position, MPI_Comm comm )‏ Packs the message in inbuf of type datatype and length=incount and stores it in outbuf . Outbuf size is specified in bytes. Outsize being the maximum length of outbuf ’in bytes’, rather than its actuaL size. On entry position indicates the starting location at the outbuf where data will be written. On exit position points to the first free position in outbuf following the location occupied by the packed message. This can then be readily used as the position parameter for the next mpi_pack call.

MPI_Unpack Fortran : <type> INBUF(:) , OUTBUF(:)‏ C : INTEGER INSIZE, POSITION, OUTCOUNT,DATATYPE, COMM,IERROR MPI_UNPACK(INBUF,INSIZE,POSITION, OUTBUF,OUTCOUNT,DATATYPE, ,COMM,IERROR )‏ C : int MPI_Unpack(void* inbuf, int insize, int *position, void *outbuf ,int outcount, MPI_Datatype datatype, MPI_Comm comm )‏ Unpacks the message which is in inbuf as data of type datatype and length of outcounts and stores it in outbuf . On entry, position indicates the starting location of data in inbuf where data will be read from. On exit position points to the first position of the next set of data in inbuf. This can then be readily used as the position parameter for the next mpi_unpack call.

Collective Communication

Introduction & characteristics Barrier Synchronisation Overview Introduction & characteristics Barrier Synchronisation Global reduction operations Predefined operations Broadcast Scatter Gather Partial sums Exercise:

Collective communications Are higher-level routines involving several processes at a time. Can be built out of point-to-point communications. Examples are: Barriers Broadcast Reduction operations

Collective Communication Communications involving a group of processes. Called by all processes in a communicator. Examples: Broadcast, scatter, gather (Data Distribution)‏ Global sum, global maximum, etc. (Reduction Operations)‏ Barrier synchronisation Characteristics Collective communication will not interfere with point-to-point communication and vice- versa. All processes must call the collective routine. Synchronization not guaranteed (except for barrier)‏ No non-blocking collective communication No tags Receive buffers must be exactly the right size

Collective Communications (one for all, all for one!!!)‏ Collective communication is defined as that which involves all the processes in a group. Collective communication routines can be divided into the following broad categories: Barrier synchronisation Broadcast from one to all. Scatter from one to all Gather from all to one. Scatter/Gather. From all to all. Global reduction (distribute elementary operations)‏ IMPORTANT NOTE: Collective Communication operations and point-to-point operations we have seen earlier are invisible to each other and hence do not interfere with each other. This is important to avoid dead-locks due to interference.

BARRIER SYNCHRONIZATION T I M E B A R R I E R STATEMENT Here, there are seven processes running and three of them are waiting idle at the barrier statement for the other four to catch up.

Graphic Representations of Collective Communication Types D C B A E D C B A P R O C E S ALLGATHER BROADCAST t s r q p o n m l k j i h g f e d c b a E D C B A E D C B A t o j e E s n i d D r m h c C q l g b B p k f a A E D C B A SCATTER ALLTOALL GATHER D A T A D A T A D A T A D A T A

Barrier Synchronisation Each processes in communicator waits at barrier until all processes encounter the barrier. Fortran: INTEGER comm, error CALL MPI_BARRIER(comm, error)‏ C: MPI_Barrier(MPI_Comm comm); C++: Comm.Barrier(); E.g.: MPI::COMM_WORLD.Barrier();

Global reduction operations Used to compute a result involving data distributed over a group of processes: Global sum or product Global maximum or minimum Global user-defined operation

Predefined operations MPI_MINLOC MPI_MAXLOC MPI_BXOR MPI_LXOR MPI_BOR MPI_LOR MPI_BAND MPI_LAND MPI_PROD MPI_SUM MPI_MIN MPI_MAX MPI name (F/C)‏ Minimum and location MPI::MINLOC Maximum and location MPI::MAXLOC Bitwise exclusive OR MPI::BXOR Logical exclusive OR MPI::LXOR Bitwise OR MPI::BOR Logical OR MPI::LOR Bitwise AND MPI::BAND Logical AND MPI::LAND Product MPI::PROD Sum MPI::SUM Minimum MPI::MIN Maximum MPI::MAX Function MPI name (C++)‏

MPI_Reduce Performs count operations (o) on individual elements of sendbuf between processes Rank A B C A B C MPI_REDUCE 1 D E F D E F 2 G H I BoEoH G H I AoDoG

MPI_Reduce syntax Fortran C: C++: INTEGER count, type, count, rtype, root, comm, error CALL MPI_REDUCE(sbuf, rbuf, count, rtype, op, root, comm, error)‏ C: MPI_Reduce(void *sbuf, void *rbuf, int count, MPI_Datatype datatype, MPI_Op op, int root, MPI_Comm comm); C++: Comm::Reduce(const void* sbuf, void* recvbuf, int count, const MPI::Datatype& datatype, const MPI::Op& op, int root);

MPI_Reduce example Fortran C: C++: Integer global sum: INTEGER x, result, error CALL MPI_REDUCE(x, result, 1, MPI_INTEGER, MPI_SUM, 0, MPI_COMM_WORLD, error)‏ C: int x, result; MPI_Reduce(&x, &result, 1, MPI_INT, MPI_SUM, 0, MPI_COMM_WORLD); C++: MPI::COMM_WORLD.Reduce(&x, &result, 1, MPI::INT, MPI::SUM);

MPI_Allreduce Rank No root process All processes get results of reduction operation Rank MPI_ALLREDUCE A B C A B C D E F D E F 1 2 G H I G H I AoDoG

MPI_Allreduce syntax Fortran C: C++: INTEGER count, type, count, rtype, comm, error CALL MPI_ALLREDUCE(sbuf, rbuf, count, rtype, op, comm, error)‏ C: MPI_Allreduce(void *sbuf, void *rbuf, int count, MPI_Datatype datatype, MPI_Op op, MPI_Comm comm); C++: Comm.Allreduce(const void* sbuf, void* recvbuf, int count, const MPI::Datatype& datatype, const MPI::Op& op);

Practice Session 3 Using reduction operations This example shows the use of the continued fraction method of calculating pi and makes each processor calculate a different portion of the expansion series.

Broadcast Broadcast Rank 1 2 3 Duplicates data from root process to other processes in communicator A Broadcast A A A A A Rank 1 2 3

Broadcast syntax Fortran: C++: INTEGER count, datatype, root, comm, error CALL MPI_BCAST(buffer, count, datatype, root, comm, error)‏ C: MPI_Bcast (void *buffer, int count, MPI_Datatype datatype, int root, MPI_Comm comm); C++: Comm.Bcast(void* buffer, int count, const MPI::Datatype& datatype, int root); E.g broadcasting 10 integers from rank 0 int tenints[10]; MPI::COMM_WORLD.Bcast(&tenints, 10, MPI::INT, 0);

Scatter Distributes data from root process amongst processors within communicator. A B C D Scatter A B C D A C D Rank 1 2 3

Scatter syntax scount (and rcount) is number of elements each process is sent (i.e. = no received)‏ Fortran INTEGER scount, stype, rcount, rtype, root, comm, error CALL MPI_SCATTER(sbuf, scount, stype, rbuf, rcount, rtype, root, comm, error)‏ C: MPI_Scatter(void *sbuf, int scount, MPI_Datatype stype, void *rbuf, int rcount, MPI_Datatype rtype, root, comm); C++: Comm.Scatter(const void* sbuf, int scount, const MPI::Datatype& stype, void* rbuf, int rcount, const MPI::Datatype& rtype, int root);

Gather Collects data distributed amongst processes in communicator onto root process ( Collection done in rank order ) . A B C D Gather A B C D A C D Rank 1 2 3

Gather syntax Takes same arguments as Scatter operation Fortran C: INTEGER scount, stype, rcount, rtype, root, comm, error CALL MPI_GATHER(sbuf, scount, stype, rbuf, rcount, rtype, root, comm, error)‏ C: MPI_Gather(void *sbuf, int scount, MPI_Datatype stype, void *rbuf, int rcount, MPI_Datatype rtype, root, comm); C++: Comm.Gather(const void* sbuf, int scount, const MPI::Datatype& stype, void* rbuf, int rcount, const MPI::Datatype& rtype, int root);

All Gather Collects all data on all processes in communicator A B C D Gather A B C D A B C D A B C D A B C D A B C D Rank 1 2 3

All Gather syntax As Gather but no root defined. Fortran C: C++: INTEGER scount, stype, rcount, rtype, comm, error CALL MPI_GATHER(sbuf, scount, stype, rbuf, rcount, rtype, comm, error)‏ C: MPI_Gather(void *sbuf, int scount, MPI_Datatype stype, void *rbuf, int rcount, MPI_Datatype rtype, comm); C++: Comm.Gather(const void* sbuf, int scount, const MPI::Datatype& stype, void* rbuf, int rcount, const MPI::Datatype& rtype);