CSE 160 – Lecture 16 MPI Concepts, Topology and Synchronization.

CSE 160 – Lecture 16 MPI Concepts, Topology and Synchronization

So Far … We have concentrated on generic message passing algorithms We have used PVM to implement codes (simpler on workstations) MPI is a defacto standard for message passing and introduces some important concepts

Message Addressing Identify an endpoint Use a tag to distinguish a particular message –pvm_send(dest, tag) –MPI_SEND(COMM, dest, tag, buf, len, type) Receiving –recv(src, tag); recv(*,tag); recv (src, *); recv(*,*); What if you want to build a library that uses message passing? Is (src, tag) safe in all instances?

Level O Issues Basic Pt-2-Pt Message Passing is straightforward, but how does one … –Make it go fast Eliminate extra memory copies Take advantage of specialized hardware –Move complex data structures (packing) –Receive from one-of-many (wildcarding) –Synchronize a group of tasks –Recover from errors –Start tasks –Build safe libraries –Monitor tasks –…

MPI-1 addresses many of the level 0 issues (but not all)

A long history of research efforts in message passing P4 Chameleon Parmacs TCGMSG CHIMP NX (Intel i860, Paragon) PVM … And these begot MPI

So What is MPI It is a standard message passing API –Specifies many variants of send/recv 9 send interface calls –Eg., synchronous send, asynchronous send, ready send, asynchronous ready send –Plus other defined APIs Process topologies Group operations Derived Data types Profiling API (standard way to instrument MPI code) –Implemented and optimized by machine vendors –Should you use it? Absolutely!

So What’s Missing in MPI-1? Process control –How do you start 100 tasks? –How do you kill/signal/monitor remote tasks I/O –Addressed in MPI-2 Fault-tolerance –One MPI process dies, the rest eventually hang Interoperability –No standard set for running a single parallel job across architectures (eg. Cannot split computation between x86 Linux and Alpha)

Communication Envelope (src, tag) is enough to distinguish messages however, there is a problem: –How does one write safe libraries that also use message passing? Tags used in libraries can be the same as tags used in calling code  messages could get confused Message envelope – add another, system- assigned tag, to a message –This label defines a message envelope

Envelopes Tag 100, Src A Tag 100, Src B Tag 200, Src A Envelope  Tag 100, Src A Tag 100, Src B Tag 200, Src A Envelope  Context tag – messages received relative to context

Communicators Messages ride inside of message envelopes and receives can only match candidate messages relative to an envelope. MPI implements these envelopes as communicators. Communicators are more than just and envelope – it also encompasses a group of processes Communicator = context + group of tasks

MPI_COMM_WORLD MPI programs are static – they can never grow or shrink There is one default communicator called MPI_COMM_WORLD All processes are members of COMM_WORLD Each process is labeled 0 – N, for a communicator of size N+1 –Similar to a PVM group.

How do you build new groups? All new communicators are derived from existing communicators –All communicators have MPI_COMM_WORLD as an “ancestor” –New communicators are formed synchronously All members of a new communicator call the construction routine at the same time. MPI_COMM_DUP(), MPI_COMM_SPLIT()

Communicators as groups 0 1 3 4 2 5 1 0 2 0 1 2

Running MPI Programs The MPI-1 Standard does not specify how to run an MPI program, just as the Fortran standard does not specify how to run a Fortran program. In general, starting an MPI program is dependent on the implementation of MPI you are using, and might require various scripts, program arguments, and/or environment variables. mpiexec is part of MPI-2, as a recommendation, but not a requirement –You can use mpiexec for MPICH and mpirun for other systems

Finding Out About the Environment Two important questions that arise early in a parallel program are: –How many processes are participating in this computation? –Which one am I? MPI provides functions to answer these questions: –MPI_Comm_size reports the number of processes. –MPI_Comm_rank reports the rank, a number between 0 and size-1, identifying the calling process

MPI Datatypes The data in a message to sent or received is described by a triple (address, count, datatype), where An MPI datatype is recursively defined as: –predefined, corresponding to a data type from the language (e.g., MPI_INT, MPI_DOUBLE_PRECISION) –a contiguous array of MPI datatypes –a strided block of datatypes –an indexed array of blocks of datatypes –an arbitrary structure of datatypes There are MPI functions to construct custom datatypes, such an array of (int, float) pairs, or a row of a matrix stored columnwise.

MPI Tags Messages are sent with an accompanying user- defined integer tag, to assist the receiving process in identifying the message. Messages can be screened at the receiving end by specifying a specific tag, or not screened by specifying MPI_ANY_TAG as the tag in a receive. Some non-MPI message-passing systems have called tags “message types”. MPI calls them tags to avoid confusion with datatypes.

Why Datatypes? Since all data is labeled by type, an MPI implementation can support communication between processes on machines with very different memory representations and lengths of elementary datatypes (heterogeneous communication). Specifying application-oriented layout of data in memory –reduces memory-to-memory copies in the implementation –allows the use of special hardware (scatter/gather) when available Theoretically simplifies the programmer’s life How would you build “Datatypes” in PVM – sequences of pvm_pack()

Introduction to Collective Operations in MPI Collective operations are called by all processes in a communicator. MPI_BCAST distributes data from one process (the root) to all others in a communicator. –How does this differ from pvm_mcast(), pvm_broadcast()? MPI_REDUCE combines data from all processes in communicator and returns it to one process. In many numerical algorithms, SEND/RECEIVE can be replaced by BCAST/REDUCE, improving both simplicity and efficiency.

Example: PI in C -1 #include "mpi.h" #include int main(int argc, char *argv[]) { int done = 0, n, myid, numprocs, i, rc; double PI25DT = 3.141592653589793238462643; double mypi, pi, h, sum, x, a; MPI_Init(&argc,&argv); MPI_Comm_size(MPI_COMM_WORLD,&numprocs); MPI_Comm_rank(MPI_COMM_WORLD,&myid); while (!done) { if (myid == 0) { printf("Enter the number of intervals: (0 quits) "); scanf("%d",&n); } MPI_Bcast(&n, 1, MPI_INT, 0, MPI_COMM_WORLD); if (n == 0) break;

Example: PI in C - 2 h = 1.0 / (double) n; sum = 0.0; for (i = myid + 1; i <= n; i += numprocs) { x = h * ((double)i - 0.5); sum += 4.0 / (1.0 + x*x); } mypi = h * sum; MPI_Reduce(&mypi, &pi, 1, MPI_DOUBLE, MPI_SUM, 0, MPI_COMM_WORLD); if (myid == 0) printf("pi is approximately %.16f, Error is %.16f\n", pi, fabs(pi - PI25DT)); } MPI_Finalize(); return 0; }

Send a large message from process 0 to process 1 –If there is insufficient storage at the destination, the send must wait for the user to provide the memory space (through a receive) What happens with Sources of Deadlocks Process 0 Send(1) Recv(1) Process 1 Send(0) Recv(0) This is called “unsafe” because it depends on the availability of system buffers (pvm does buffering)

Some Solutions to the “unsafe” Problem Order the operations more carefully: Process 0 Send(1) Recv(1) Process 1 Recv(0) Send(0) Use non-blocking operations: Process 0 Isend(1) Irecv(1) Waitall Process 1 Isend(0) Irecv(0) Waitall

Extending the Message-Passing Interface Dynamic Process Management –Dynamic process startup –Dynamic establishment of connections One-sided communication –Put/get –Other operations Parallel I/O Other MPI-2 features –Generalized requests –Bindings for C++/ Fortran-90; interlanguage issues

When to use MPI Portability and Performance Irregular Data Structures Building Tools for Others –Libraries Need to Manage memory on a per processor basis

When not to use MPI Regular computation matches HPF –But see PETSc/HPF comparison (ICASE 97-72)ICASE 97-72 Solution (e.g., library) already exists –http://www.mcs.anl.gov/mpi/libraries.htmlhttp://www.mcs.anl.gov/mpi/libraries.html Require Fault Tolerance –Sockets Distributed Computing –CORBA, DCOM, etc.

Synchronization Messages serve two purposes: –Moving data –Synchronizing processes When does synchronization occur for –Blocking send? –Non-blocking send? »Dependencies on implementation

Common Synchronization Constructs Barrier –All processes stop, No information passed Broadcast –All processes get information from the root How does this differ from a barrier Ideas on how to implement an MPI_BCAST? Reduction –Data is reduced (summed, multiplied, counted, …) Ideas on how to implement MPI_REDUCE?

Synchronization Trees The topology of the messages that implement a reduction is important. LogP lecture illustrated a tree that wasn’t linear and wasn’t binary –Major idea was to maintain concurrency MPI has many operations as synchronizing so that implementers have a choice in optimization.

CSE 160 – Lecture 16 MPI Concepts, Topology and Synchronization.

Similar presentations

Presentation on theme: "CSE 160 – Lecture 16 MPI Concepts, Topology and Synchronization."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

CSE 160 – Lecture 16 MPI Concepts, Topology and Synchronization.

Similar presentations

Presentation on theme: "CSE 160 – Lecture 16 MPI Concepts, Topology and Synchronization."— Presentation transcript:

Similar presentations

About project

Feedback