MPI Derived Data Types and Collective Communication

Slides:



Advertisements
Similar presentations
1 Introduction to Collective Operations in MPI l Collective operations are called by all processes in a communicator. MPI_BCAST distributes data from one.
Advertisements

Parallel Processing1 Parallel Processing (CS 667) Lecture 9: Advanced Point to Point Communication Jeremy R. Johnson *Parts of this lecture was derived.
Its.unc.edu 1 Collective Communication University of North Carolina - Chapel Hill ITS - Research Computing Instructor: Mark Reed
MPI Collective Communications
MPI_Gatherv CISC372 Fall 2006 Andrew Toy Tom Lynch Bill Meehan.
A Message Passing Standard for MPP and Workstations Communications of the ACM, July 1996 J.J. Dongarra, S.W. Otto, M. Snir, and D.W. Walker.
12c.1 Collective Communication in MPI UNC-Wilmington, C. Ferner, 2008 Nov 4, 2008.
SOME BASIC MPI ROUTINES With formal datatypes specified.
Message-Passing Programming and MPI CS 524 – High-Performance Computing.
MPI Collective Communication CS 524 – High-Performance Computing.
Collective Communications
MPI Point-to-Point Communication CS 524 – High-Performance Computing.
Collective Communication.  Collective communication is defined as communication that involves a group of processes  More restrictive than point to point.
Distributed Systems CS Programming Models- Part II Lecture 17, Nov 2, 2011 Majd F. Sakr, Mohammad Hammoud andVinay Kolar 1.
Parallel Programming with Java
Collective Communication
A Message Passing Standard for MPP and Workstations Communications of the ACM, July 1996 J.J. Dongarra, S.W. Otto, M. Snir, and D.W. Walker.
Parallel Programming with MPI Matthew Pratola
Parallel Programming and Algorithms – MPI Collective Operations David Monismith CS599 Feb. 10, 2015 Based upon MPI: A Message-Passing Interface Standard.
2a.1 Message-Passing Computing More MPI routines: Collective routines Synchronous routines Non-blocking routines ITCS 4/5145 Parallel Computing, UNC-Charlotte,
1 Collective Communications. 2 Overview  All processes in a group participate in communication, by calling the same function with matching arguments.
1 MPI: Message-Passing Interface Chapter 2. 2 MPI - (Message Passing Interface) Message passing library standard (MPI) is developed by group of academics.
Part I MPI from scratch. Part I By: Camilo A. SilvaBIOinformatics Summer 2008 PIRE :: REU :: Cyberbridges.
PP Lab MPI programming VI. Program 1 Break up a long vector into subvectors of equal length. Distribute subvectors to processes. Let them compute the.
Parallel Computing A task is broken down into tasks, performed by separate workers or processes Processes interact by exchanging information What do we.
1 Review –6 Basic MPI Calls –Data Types –Wildcards –Using Status Probing Asynchronous Communication Collective Communications Advanced Topics –"V" operations.
Parallel Programming with MPI Prof. Sivarama Dandamudi School of Computer Science Carleton University.
Jonathan Carroll-Nellenback CIRC Summer School MESSAGE PASSING INTERFACE (MPI)
CS 838: Pervasive Parallelism Introduction to MPI Copyright 2005 Mark D. Hill University of Wisconsin-Madison Slides are derived from an online tutorial.
MPI Communications Point to Point Collective Communication Data Packaging.
Plan: I. Introduction: Programming Model II. Basic MPI Command III. Examples IV. Collective Communications V. More on Communication modes VI. References.
Message Passing Programming Model AMANO, Hideharu Textbook pp. 140-147.
Summary of MPI commands Luis Basurto. Large scale systems Shared Memory systems – Memory is shared among processors Distributed memory systems – Each.
1 Overview on Send And Receive routines in MPI Kamyar Miremadi November 2004.
Distributed-Memory (Message-Passing) Paradigm FDI 2004 Track M Day 2 – Morning Session #1 C. J. Ribbens.
MPI (continue) An example for designing explicit message passing programs Advanced MPI concepts.
Parallel Programming with MPI By, Santosh K Jena..
CSCI-455/522 Introduction to High Performance Computing Lecture 4.
Introduction to MPI CDP 1. Shared Memory vs. Message Passing Shared Memory Implicit communication via memory operations (load/store/lock) Global address.
MPI Point to Point Communication CDP 1. Message Passing Definitions Application buffer Holds the data for send or receive Handled by the user System buffer.
1 Using PMPI routines l PMPI allows selective replacement of MPI routines at link time (no need to recompile) l Some libraries already make use of PMPI.
2.1 Collective Communication Involves set of processes, defined by an intra-communicator. Message tags not present. Principal collective operations: MPI_BCAST()
Grouping Data and Derived Types in MPI. Grouping Data Messages are expensive in terms of performance Grouping data can improve the performance of your.
Message Passing Programming Based on MPI Collective Communication I Bora AKAYDIN
Message Passing Interface Using resources from
COMP7330/7336 Advanced Parallel and Distributed Computing MPI Programming - Exercises Dr. Xiao Qin Auburn University
COMP7330/7336 Advanced Parallel and Distributed Computing MPI Programming: 1. Collective Operations 2. Overlapping Communication with Computation Dr. Xiao.
ITCS 4/5145 Parallel Computing, UNC-Charlotte, B
MPI_Alltoall By: Jason Michalske. What is MPI_Alltoall? Each process sends distinct data to each receiver. The Jth block of process I is received by process.
Distributed Processing with MPI International Summer School 2015 Tomsk Polytechnic University Assistant Professor Dr. Sergey Axyonov.
3/12/2013Computer Engg, IIT(BHU)1 MPI-2. POINT-TO-POINT COMMUNICATION Communication between 2 and only 2 processes. One sending and one receiving. Types:
Computer Science Department
Introduction to MPI Programming Ganesh C.N.
CS4402 – Parallel Computing
Introduction to MPI Programming
Computer Science Department
Send and Receive.
CS 584.
Collective Communication Operations
Send and Receive.
More on MPI Nonblocking point-to-point routines Deadlock
ITCS 4/5145 Parallel Computing, UNC-Charlotte, B
Lecture 14: Inter-process Communication
A Message Passing Standard for MPP and Workstations
More on MPI Nonblocking point-to-point routines Deadlock
Hardware Environment VIA cluster - 8 nodes Blade Server – 5 nodes
Message Passing Programming Based on MPI
Computer Science Department
5- Message-Passing Programming
CS 584 Lecture 8 Assignment?.
Presentation transcript:

MPI Derived Data Types and Collective Communication CDP

Why Derived Data Types? Compilation error: Elements in an MPI message are of the same type. Complex data, requires two separate messages. Bad example: typedef struct { float a,b; int n; } Mine; MPI_Send(&data, 1, Mine, 0, 99, MPI_COMM_WORLD); Reminder: int MPI_Send(void *sndbuf, int count, MPI_Datatype datatype, int dest, int tag, MPI_Comm comm) Compilation error: A type can’t be passed as a parameter

Why Derived Data Types? Elements in an MPI message are of the same type. Complex data, requires two separate messages. Bad example: typedef struct { float a,b; int n; } Mine; MPI_Send(&data, sizeof(Mine), MPI_BYTE, 0, 99, MPI_COMM_WORLD); Reminder: int MPI_Send(void *sndbuf, int count, MPI_Datatype datatype, int dest, int tag, MPI_Comm comm) Will fail when communicating between different kind of machines Different machines can interpret the data differently (32bit/64bit, different padding, endianness, etc..)

Defining Derived Data Types typedef struct { float a,b; int n; } Mine; Defining Derived Data Types // Build a derived datatype for two floats and an int void build_derived_type(Mine* indata, MPI_Datatype* message_type_ptr) { int block_lengths[3]; MPI_Aint displacements[3]; MPI_Datatype typelist[3]; MPI_Aint addresses[4]; // Helper array // First specify the types typelist[0] = MPI_FLOAT; typelist[1] = MPI_FLOAT; typelist[2] = MPI_INT; // Specify the number of elements of each type block_lengths[0] = block_lengths[1] = block_lengths[2] = 1; // Calculate the displacements of the members relative to indata MPI_Address(indata, &addresses[0]); MPI_Address(&(indata->a), &addresses[1]); MPI_Address(&(indata->b), &addresses[2]); MPI_Address(&(indata->n), &addresses[3]); displacements[0] = addresses[1] - addresses[0]; displacements[1] = addresses[2] - addresses[0]; displacements[2] = addresses[3] - addresses[0]; // Create the derived type MPI_Type_struct(3, block_lengths, displacements, typelist, message_type_ptr); // Commit it so that it can be used MPI_Type_commit(message_type_ptr); } Without the commit, an error will occur when using the new type. All the processes must create the derived data type and commit it.

Defining Derived Data Types Mine data1, data2, data3[10]; MPI_Datatype my_type; Build_derived_type(&data1, &my_type); // Using the new data type MPI_Send (&data1, 1, my_type, 0, 98, MPI_COMM_WORLD); MPI_Ssend (&data2, 1, my_type, 0, 99, MPI_COMM_WORLD); MPI_Isend (data3 , 10, my_type, 0, 97, MPI_COMM_WORLD);

MPI Collective Communication Reuse code for patterns that appear in different types of applications Built by using point-to-point communication routines Collective communication routines in MPI: Barrier synchronization Broadcast from one member to all other members Scatter data from one member to all other members Gather data from all members to one member All-to-all exchange of data Global reduction (e.g., sum, min of "common" data element to one node) Scan across all members of a group Synchronization, data movement, and global computation

Collective communication routines characteristics Coordinated communication within a group of processes Identified by an MPI communicator No message tags are needed Substitute for a more complex sequence of point-to-point calls All routines block until they are locally completed Starting from MPI 1.7, non-blocking collective communication is supported In some cases, a root process originates or receives all data Amount of data sent must exactly match amount of data specified by receiver We will work with MPI 1.5

Barrier synchronization routine A synchronization primitive Blocking until all the nodes in the group have called it MPI_Barrier(MPI_Comm comm) comm: a communicator

Data movement routines - Broadcast One processor sends data to all the processes in a group int MPI_Bcast(void* buffer, int count, MPI_Datatype datatype, int root, MPI_Comm comm) Must be called by each node in the group, specifying the same comm, root and count MPI_Comm comm; int array[100]; int root=1; MPI_Bcast( array, 100, MPI_INT, root, comm); buffer A 1 processes 2 3 4 buffer A 1 processes 2 3 4 Broadcast

Gather and Scatter int MPI_Scatter(void* sbuf, int scount, MPI_Datatype stype, void* rbuf, int rcount, MPI_Datatype rtype, int root, MPI_Comm comm) int MPI_Gather(void* sbuf, int scount, MPI_Datatype stype, void* rbuf, int rcount, MPI_Datatype rtype, int root, MPI_Comm comm) scount and rcount specifies the amount of elements to send/recv to/from each process (not the total) sbuf / rbuf D C B A 1 processes 2 3 4 rbuf / sbuf A 1 processes B 2 C 3 D 4 Scatter Gather

MPI_Scatter Scatter sets of 100 ints from the root to each process in the group At all processes MPI_Scatter(NULL, 100, MPI_INT, rbuf, 100, MPI_INT, root, comm) At root = 0 MPI_Scatter(sbuf, 100, MPI_INT, rbuf, 100, MPI_INT, root, comm) What is the size of sbuf? You can send less elements then you intend to receive, but the receiver cannot check how many items actually received 100 100 100 At all processes 100 100 100 At root sbuf 11

Gatherv and Scatterv Allow each node to send/receive a different number of elements int MPI_Gatherv(void* sbuf, int scount, MPI_Datatype stype, void* rbuf, int *rcount, int* displs, MPI_Datatype rtype, int root, MPI_Comm comm) int MPI_Scatterv(void* sbuf, int* scount, int* displa, MPI_Datatype stype, void* rbuf, int rcount, MPI_Datatype rtype, int root, MPI_Comm comm)

MPI_Gatherv At all processes: scount=100 or 150 MPI_Gatherv(sbuf, scount, MPI_INT, NULL, NULL, NULL, MPI_INT, root, comm); At root = 0 rcount[3] = {50,100,150} MPI_Gatherv(sbuf, 50, MPI_INT, rbuf, rcount, displs, MPI_INT, root, comm); Allows scatter/gather only a part of your data without having to pack it to a new array 50 100 150 At all processes At root displs[0] Displacement differences must be larger than the corresponding rcounts entry displs[1] displs[2] rbuf

Allgather D C B A 1 2 3 4 A 1 B 2 C 3 D 4 rubf sbuf processes int MPI_Allgather(void* sbuf, int scount, MPI_Datatype stype, void* rbuf, int rcount, MPI_Datatype rtype, MPI_Comm comm) int MPI_Allgatherv(void* sbuf, int scount, MPI_Datatype stype, void* rbuf, int* rcount, int* displs, MPI_Datatype rtype, MPI_Comm comm) rubf D C B A 1 processes 2 3 4 sbuf A 1 processes B 2 C 3 D 4 All Gather

All to All 1 2 3 4 1 2 3 4 sbuf processes rbuf processes All to All int MPI_Alltoall(void* sbuf, int scount, MPI_Datatype stype, void* rbuf, int rcount, MPI_Datatype rtype, MPI_Comm comm ) scount must be equal to rcount; there is a variants that allows different data sizes and types sbuf D1 C1 B1 A1 1 processes D2 C2 B2 A2 2 D3 C3 B3 A3 3 D4 C4 B4 A4 4 rbuf A4 A3 A2 A1 1 processes B4 B3 B2 B1 2 C4 C3 C2 C1 3 D4 D3 D2 D1 4 All to All

Global computation routines Reduce 𝐷(𝑖,𝑗) is the jth data item in process i In the root process: 𝐷 𝑗 = 𝐷(0,𝑗)⊛𝐷(1,𝑗)⊛…⊛𝐷(𝑛−1,𝑗) Operator must be associative: 𝑎⊛𝑏 ⊛𝑐=𝑎⊛ 𝑏⊛𝑐 MPI predefined operators are also commutative: 𝑎⊛𝑏=𝑏⊛𝑎 int MPI_Reduce(void* sbuf, void* rbuf, int count, MPI_Datatype stype, MPI_Op op, int root, MPI_Comm comm) int MPI_Allreduce(void* sbuf, void* rbuf, int count, MPI_Datatype stype, MPI_Op op, MPI_Comm comm) int MPI_Reduce_scatter (void* sbuf, void* rbuf, int* rcounts, MPI_Datatype stype, MPI_Op op, MPI_Comm comm) rcounts indicate the number of elements in result distributed to each process (vector) Array must be identical on all calling processes Associative: allows tree form calculation Commutative: allows further optimization (e.g. calculate when data ready, no need to keep partial sum)

Scan 𝐷(𝑘,𝑗) - the jth data item in process k after returning from scan 𝑑(𝑘,𝑗) - the jth data item in process k before the scan For processor k: 𝐷 𝑘,𝑗 = 𝑑 0,𝑗 ⊛𝑑 1,𝑗 ⊛…⊛𝑑 𝑘,𝑗 int MPI_Scan(void* sbuf, void* rbuf, int count, MPI_Datatype datatype, MPI_Op op, MPI_Comm comm) Where we could have it before?

Predefined reduce operations Name Meaning C type MPI_MAX maximum integer, float MPI_MIN minimum MPI_SUM sum MPI_PROD product MPI_LAND logical and integer MPI_BAND bit-wise and integer, MPI_BYTE MPI_LOR logical or MPI_BOR bit-wise or MPI_LXOR logical xor MPI_BXOR bit-wise xor MPI_MAXLOC max value and location combination of int, float, double, and long double MPI_MINLOC min value and location MPI_MAXLOC and MPI_MINLOC allows to use coplex types such as MPI_FLOAT_INT, MPI_DOUBLE_INT, etc. The op will be done on the first element (i.e. in MPI_FLOAT_INT: on the float) but the second will also be returned. This gives more information about the item (e.g. index)

User-defined operations typedef void MPI_User_function( void *invec, void *inoutvec, int *len, MPI_Datatype *datatype); 𝑖𝑛𝑜𝑢𝑡𝑣𝑒𝑐[𝑖]⟵𝑖𝑛𝑣𝑒𝑐[𝑖]⊛𝑖𝑛𝑜𝑢𝑡𝑣𝑒𝑐[𝑖] , 𝑖∈[ 0…𝑙𝑒𝑛−1] No MPI communication function may be called inside the user function MPI_ABORT may be called inside the function in case of an error MPI_Op_create(MPI_User_function *function, int commute, MPI_Op *op) Should be called on all processors User-defined operation is deallocated using: int MPI_op_free( MPI_Op *op)

Performance issues A great deal of hidden communication takes place with collective communication Performance depends greatly on the particular implementation of MPI Because there may be forced synchronization, not always best to use collective communication