1 Collective Communications. 2 Overview  All processes in a group participate in communication, by calling the same function with matching arguments.

Slides:



Advertisements
Similar presentations
Its.unc.edu 1 Collective Communication University of North Carolina - Chapel Hill ITS - Research Computing Instructor: Mark Reed
Advertisements

MPI Collective Communications
1 Collective Operations Dr. Stephen Tse Lesson 12.
MPI_Gatherv CISC372 Fall 2006 Andrew Toy Tom Lynch Bill Meehan.
HPDC Spring MPI 11 CSCI-6964: High Performance Parallel & Distributed Computing (HPDC) AE 216, Mon/Thurs. 2 – 3:20 p.m Message Passing Interface.
12c.1 Collective Communication in MPI UNC-Wilmington, C. Ferner, 2008 Nov 4, 2008.
SOME BASIC MPI ROUTINES With formal datatypes specified.
MPI Collective Communication CS 524 – High-Performance Computing.
EECC756 - Shaaban #1 lec # 7 Spring Message Passing Interface (MPI) MPI, the Message Passing Interface, is a library, and a software standard.
MPI Workshop - II Research Staff Week 2 of 3.
Collective Communications
Collective Communication.  Collective communication is defined as communication that involves a group of processes  More restrictive than point to point.
Parallel Programming with Java
Parallel Programming with MPI Matthew Pratola
L15: Putting it together: N-body (Ch. 6) October 30, 2012.
ORNL is managed by UT-Battelle for the US Department of Energy Crash Course In Message Passing Interface Adam Simpson NCCS User Assistance.
Parallel Processing1 Parallel Processing (CS 676) Lecture 7: Message Passing using MPI * Jeremy R. Johnson *Parts of this lecture was derived from chapters.
Chapter 6 Parallel Sorting Algorithm Sorting Parallel Sorting Bubble Sort Odd-Even (Transposition) Sort Parallel Odd-Even Transposition Sort Related Functions.
Parallel Programming and Algorithms – MPI Collective Operations David Monismith CS599 Feb. 10, 2015 Based upon MPI: A Message-Passing Interface Standard.
2a.1 Message-Passing Computing More MPI routines: Collective routines Synchronous routines Non-blocking routines ITCS 4/5145 Parallel Computing, UNC-Charlotte,
ECE 1747H : Parallel Programming Message Passing (MPI)
1 MPI: Message-Passing Interface Chapter 2. 2 MPI - (Message Passing Interface) Message passing library standard (MPI) is developed by group of academics.
HPCA2001HPCA Message Passing Interface (MPI) and Parallel Algorithm Design.
1 Why Derived Data Types  Message data contains different data types  Can use several separate messages  performance may not be good  Message data.
PP Lab MPI programming VI. Program 1 Break up a long vector into subvectors of equal length. Distribute subvectors to processes. Let them compute the.
Parallel Programming with MPI Prof. Sivarama Dandamudi School of Computer Science Carleton University.
CS 838: Pervasive Parallelism Introduction to MPI Copyright 2005 Mark D. Hill University of Wisconsin-Madison Slides are derived from an online tutorial.
Message Passing Programming Model AMANO, Hideharu Textbook pp. 140-147.
Summary of MPI commands Luis Basurto. Large scale systems Shared Memory systems – Memory is shared among processors Distributed memory systems – Each.
Parallel Programming with MPI By, Santosh K Jena..
Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd Edition, by B. Wilkinson & M. Allen, ©
CSCI-455/522 Introduction to High Performance Computing Lecture 4.
Oct. 23, 2002Parallel Processing1 Parallel Processing (CS 730) Lecture 6: Message Passing using MPI * Jeremy R. Johnson *Parts of this lecture was derived.
MPI Jakub Yaghob. Literature and references Books Gropp W., Lusk E., Skjellum A.: Using MPI: Portable Parallel Programming with the Message-Passing Interface,
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Message-passing Model.
1. 2 The logical view of a machine supporting the message-passing paradigm consists of p processes, each with its own exclusive address space. The logical.
2.1 Collective Communication Involves set of processes, defined by an intra-communicator. Message tags not present. Principal collective operations: MPI_BCAST()
-1.1- MPI Lectured by: Nguyễn Đức Thái Prepared by: Thoại Nam.
Message Passing Programming Based on MPI Collective Communication I Bora AKAYDIN
MPI Derived Data Types and Collective Communication
Message Passing Interface Using resources from
COMP7330/7336 Advanced Parallel and Distributed Computing MPI Programming - Exercises Dr. Xiao Qin Auburn University
COMP7330/7336 Advanced Parallel and Distributed Computing MPI Programming: 1. Collective Operations 2. Overlapping Communication with Computation Dr. Xiao.
ITCS 4/5145 Parallel Computing, UNC-Charlotte, B
1 MPI: Message Passing Interface Prabhaker Mateti Wright State University.
Distributed Processing with MPI International Summer School 2015 Tomsk Polytechnic University Assistant Professor Dr. Sergey Axyonov.
Computer Science Department
Introduction to MPI Programming Ganesh C.N.
MPI Jakub Yaghob.
CS4402 – Parallel Computing
Introduction to MPI Programming
MPI Message Passing Interface
Computer Science Department
Send and Receive.
Collective Communication with MPI
An Introduction to Parallel Programming with MPI
Collective Communication Operations
Send and Receive.
ITCS 4/5145 Parallel Computing, UNC-Charlotte, B
Distributed Systems CS
CS 5334/4390 Spring 2017 Rogelio Long
Collective Communication in MPI and Advanced Features
High Performance Parallel Programming
MPI: Message Passing Interface
Message-Passing Computing More MPI routines: Collective routines Synchronous routines Non-blocking routines ITCS 4/5145 Parallel Computing, UNC-Charlotte,
ITCS 4/5145 Parallel Computing, UNC-Charlotte, B
Hardware Environment VIA cluster - 8 nodes Blade Server – 5 nodes
Computer Science Department
5- Message-Passing Programming
Parallel Processing - MPI
Presentation transcript:

1 Collective Communications

2 Overview  All processes in a group participate in communication, by calling the same function with matching arguments.  Types of collective operations:  Synchronization: MPI_Barrier  Data movement: MPI_Bcast, MPI_Scatter, MPI_Gather, MPI_Allgather, MPI_Alltoall  Collective computation: MPI_Reduce, MPI_Allreduce, MPI_Scan  Collective routines are blocking:  Completion of call means the communication buffer can be accessed  No indication on other processes’ status of completion  May or may not have effect of synchronization among processes.

3 Overview  Can use same communicators as PtP communications  MPI guarantees messages from collective communications will not be confused with PtP communications.  Key is a group of processes participating in communication  If you want only a sub-group of processes involved in collective communication, need to create a sub- group/sub-communicator from MPI_COMM_WORLD

4 Barrier  Blocks the calling process until all group members have called it.  Affect performance. Refrain from using it. int MPI_Barrier(MPI_Comm comm) MPI_BARRIER(COMM,IERROR) integer COMM, IERROR … MPI_Barrier(MPI_COMM_WORLD); // synchronization point …

5 Broadcast  Broadcasts a message from process with rank root to all processes in group, including itself.  comm, root must be the same in all processes  The amount of data sent must be equal to amount of data received, pairwise between each process and the root  For now, means count and datatype must be the same for all processes; may be different when generalized datatypes are involved. int MPI_Bcast(void *buffer, int count, MPI_Datatype datatype,int root, MPI_Comm comm) MPI_BCAST(BUFFER, COUNT, DATATYPE, ROOT, COMM) integer BUFFER, COUNT, DATATYPE, ROOT, COMM int num=-1; If(my_rank==0) num=100; … MPI_Bcast(&num, 1, MPI_INT, 0, MPI_COMM_WORLD); …

6 Gather  Gathers message to root; concatenated based on rank order at root process  Recvbuf, recvcount, recvtype are only important at root; ignored in other processes.  root and comm must be identical on all processes.  recvbuf and sendbuf cannot be the same on root process.  Amount of data sent from a process must be equal to amount of data received at root  For now, recvcount=sendcount, recvtype=sendtype.  recvcount is the number of items received from each process, not the total number of items received, not the size of receive buffer! Int MPI_Gather(void *sendbuf, int sendcount, MPI_Datatype sendtype, void *recvbuf, int recvcount, MPI_Datatype recvtype, int root, MPI_Comm comm) MPI_Gather(SENDBUF,SENDCOUNT,SENDTYPE,RECVBUF,RECVCOUNT,RECVTYPE,ROOT,COMM, IERROR) SENDBUF(*), RECVBUF(*) integer SENDCOUNT,SENDTYPE,RECVCOUNT,RECVTYPE,ROOT,COMM

7 Gather Example int rank, ncous; int root = 0; int *data_received=NULL, data_send[100]; // assume running with 10 cpus MPI_Comm_rank(MPI_COMM_WORLD, &rank); MPI_Comm_size(MPI_COMM_WORLD, &ncpus); if(rank==root) data_received = new int[100*ncpus]; // 100*10 MPI_Gather(data_send, 100, MPI_INT, data_received, 100, MPI_INT, root, MPI_COMM_WORLD); // ok // MPI_Gather(data_send,100,MPI_INT,data_received, 100*ncpus, MPI_INT, root, MPI_COMM_WORLD);  wrong

8 Gather to All  Concatenated messages according to rank order received by all processes  recvcount is the number of items from each process, not the total number of items received.  For now, sendcount=recvcount,sendtype=recvtype Int MPI_Allgather(void *sendbuf, int sendcount, MPI_Datatype *sendtype, void *recvbuf, int recvcount, MPI_Datatype *recvtype, MPI_Comm comm) int A[100], B[1000]; // assume 10 processors MPI_Allgather(A, 100, MPI_INT, B, 100, MPI_INT, MPI_COMM_WORLD); // ok?... MPI_Allgather(A, 100, MPI_INT, B, 1000, MPI_INT, MPI_COMM_WORLD); // ok?

9 Scatter  Inverse to MPI_Gather  Split message into ncpus equal segments; n -th segment to n -th process.  sendbuf, sendcount, sendtype important only at root, ignored in other processes.  sendcount is the number of items sent to each process, not the total number of items in sendbuf. Int MPI_Scatter(void *sendbuf, int sendcount, MPI_Datatype *sendtype, void *recvbuf, int recvcount, MPI_Datatype *recvtype, int root, MPI_Comm comm)

10 Scatter Example int A[1000], B[100];... // initializa A etc // assume 10 processors MPI_Scatter(A, 100, MPI_INT, B, 100, MPI_INT, 0, MPI_COMM_WORLD); // ok?... MPI_Scatter(A, 1000, MPI_INT, B, 100, MPI_INT, 0, MPI_COMM_WORLD); // ok?

11 All-to-All  Important for distributed matrix transposition; critical to FFT-based algorithms  Most stressful communication.  sendcount is the number of items sent to each process, not the total number of items in sendbuf.  recvcount is the number of items received from each process, not the total number of items received.  For now, sendcount=recvcount, sendtype=recvtype Int MPI_Alltoall(void *sendbuf, int sendcount, MPI_Datatype sendtype, void *recvbuf, int recvcount, MPI_Datatype recvtype, MPI_Comm comm)

12 All-to-All Example double A[4], B[4];... // assume 4 cpus for(i=0;i<4;i++) A[i] = my_rank + i; MPI_Alltoall(A, 4, MPI_DOUBLE, B, 4, MPI_DOUBLE, MPI_COMM_WORLD); // ok? MPI_Alltoall(A, 1, MPI_DOUBLE, B, 1, MPI_DOUBLE, MPI_COMM_WORLD); // ok? Cpu 0 Cpu 1 Cpu 2 Cpu 3

13 Reduction  Perform global reduction operations (sum, max, min, and, etc) across processors.  MPI_Reduce – return result to one processor  MPI_Allreduce – return result to all processors  MPI_Reduce_scatter – scatter reduction result across processors  MPI_Scan – parallel prefix operation

14 Reduction  Element-wise combine data from input buffers across processors using operation op ; store results in output buffer on processor root.  All processes must provide input/output buffers of the same length and data type.  Operation op must be associative:  Pre-defined operations  User can define own operations Int MPI_Reduce(void *sendbuf, void *recvbuf, int count, MPI_Datatype datatype, MPI_Op op, int root, MPI_Comm comm) int rank, res; MPI_Comm_rank(MPI_COMM_WORLD, &rank); MPI_Reduce(&rank,&res,1,MPI_INT,MPI_MAX,0,MPI_COMM_WORLD);

15 Pre-Defined Operations MPI_MAX MPI_MIN MPI_SUM MPI_PROD MPI_LANDLogical AND MPI_LORLogical OR MPI_BANDBitwise AND MPI_BORBitwise OR MPI_LXOR MPI_BXOR MPI_MAXLOCmax + location MPI_MINLOCmin + location

16 All Reduce  Reduction result stored on all processors. int MPI_Allreduce(void *sendbuf, void *recvbuf, int count, MPI_Datatype datatype, MPI_Op op, MPI_Comm comm) int rank, res; MPI_Comm_rank(MPI_COMM_WORLD, &rank); MPI_Allreduce(&rank, &res, 1, MPI_INT, MPI_MAX, MPI_COMM_WORLD);

17 Scan  Prefix reduction  To process j, return results of reduction on input buffers of processes 0, 1, …, j. Int MPI_Scan(void *sendbuf, void *recvbuf, int count, MPI_Datatype datatype, MPI_Op op, MPI_Comm comm)

18 Example: Matrix Transpose A 11 A 12 A 13 A 21 A 22 A 23 A 31 A 32 A 33 A – NxN matrix Distributed on P cpus Row-wise decomposition B = A T B also distributed on P cpus Rwo-wise decomposition A ij – (N/P)x(N/P) matrices B ij =A ji T Input: A[i]][j] = 2*i+j A 11 T A 21 T A 31 T A 12 T A 22 T A 32 T A 13 T A 23 T A 33 T AB A 11 T A 12 T A 13 T A 21 T A 22 T A 23 T A 31 T A 32 T A 33 T Local transpose All-to-all

19 Example: Matrix Transpose On each cpu, A is (N/P)xN matrix; First need to first re-write to P blocks of (N/P)x(N/P) matrices, then can do local transpose A: 2x Two 2x2 blocks After all-to-all comm, have P blocks of (N/P)x(N/P) matrices; Need to merge into a (N/P)xN matrix Three steps: 1.Divide A into blocks; 2.Transpose each block locally; 3.All-to-all comm; 4.Merge blocks locally;

20 #include #include "dmath.h" #define DIM 1000 // global A[DIM], B[DIM] int main(int argc, char **argv) { int ncpus, my_rank, i, j, iblock; int Nx, Ny; // Nx=DIM/ncpus, Ny=DIM, local array: A[Nx][Ny], B[Nx][Ny] double **A, **B, *Ctmp, *Dtmp; MPI_Init(&argc, &argv); MPI_Comm_rank(MPI_COMM_WORLD, &my_rank); MPI_Comm_size(MPI_COMM_WORLD, &ncpus); if(DIM%ncpus != 0) { // make sure DIM can be divided by ncpus if(my_rank==0) printf("ERROR: DIM cannot be divided by ncpus!\n"); MPI_Finalize(); return -1; } Nx = DIM/ncpus; Ny = DIM; A = DMath::newD(Nx, Ny); // allocate memory B = DMath::newD(Nx, Ny); Ctmp = DMath::newD(Nx*Ny); // work space Dtmp = DMath::newD(Nx*Ny); // work space for(i=0;i<Nx;i++) for(j=0;j<Ny;j++) A[i][j] = 2*(my_rank*Nx+i) + j; memset(&B[0][0], '\0', sizeof(double)*Nx*Ny); // zero out B Matrix Transposition

21 // divide A into blocks --> Ctmp; A[i][iblock*Nx+j]  Ctmp[iblock][i][j] for(i=0;i<Nx;i++) for(iblock=0;iblock<ncpus;iblock++) for(j=0;j<Nx;j++) Ctmp[iblock*Nx*Nx+i*Nx+j] = A[i][iblock*Nx+j]; // local transpose of A --> Dtmp; Ctmp[iblock][i][j]  Dtmp[iblock][j][i] for(iblock=0;iblock<ncpus;iblock++) for(i=0;i<Nx;i++) for(j=0;j<Nx;j++) Dtmp[iblock*Nx*Nx+i*Nx+j] = Ctmp[iblock*Nx*Nx+j*Nx+i]; // All-to-all comm --> Ctmp MPI_Alltoall(Dtmp, Nx*Nx, MPI_DOUBLE, Ctmp, Nx*Nx, MPI_DOUBLE, MPI_COMM_WORLD); // merge blocks --> B; Ctmp[iblock][i][j]  B[i][iblock*Nx+j] for(i=0;i<Nx;i++) for(iblock=0;iblock<ncpus;iblock++) for(j=0;j<Nx;j++) B[i][iblock*Nx+j] = Ctmp[iblock*Nx*Nx+i*Nx+j]; // clean up DMath::del(A); DMath::del(B); DMath::del(Ctmp); DMath::del(Dtmp); MPI_Finalize(); return 0; }

22 Project #1: FFT of 3D Matrix  A: 3D Matrix of real numbers, NxNxN  Distributed over P CPUs:  1D decomposition: x direction in C, z direction in FORTRAN;  (bonus) 2D decomposition: x and y directions in C, or y and z directions in FORTRAN;  Compute the 3D FFT of this matrix using fftw library ( N/P N N 1D decomposition N/P N N x y z y x z

23 Project #1  FFTW library will be available on ITAP machines  Fftw user’s manual available at  Refer to manual on how to use fftw functions.  FFTW is serial  It has an MPI parallel version (fftw 2.1.5), suitable for 1D decomposition.  You cannot use the fftw routines for MPI for this project.  3D fft can be done in several steps, e.g.  First real-to-complex fft in z direction  Then complex fft in y direction  Then complex fft in x direction  When doing fft in a direction, e.g. x direction, if matrix is distributed/decomposed in that direction,  need to first do a matrix transposition to get all data along that direction  Then call fftw function to perform fft along that direction  Then you may/will need to transpose matrix back.

24 Project #1  Write a parallel C, C++, or FORTRAN program to first compute the fft of matrix A, store result in matrix B; then compute the inverse fft of B, store result in C. Check the correctness of your code by comparing data in A and C. Make sure your program is correct by testing with some small matrices, e.g. using a 4x4x4 matrix.  If you want to get the bonus points, you can also implement only the 2D data decomposition; then let the number of cpus in one direction be 1, and your code will be able to handle 1D data decompositions  Let A be a matrix of size 256x256x256, A[i][j][k]=3*i+2*j+k  Run your code on 1, 2, 4, 8, 16 processors, and record the wall clock time of main code section for the work (transpositions, ffts, inverse ffts etc) using MPI_Wtime().  Compute the speedup factors, Sp = T1/Tp  Turn in:  Your source codes + a compiled binary code on hamlet or radon  Plot of speedup vs. number of CPUs for each data decomposition  Write-up of what you have learned from this project.  Due: 10/30

25 N N/P N