Its.unc.edu 1 Collective Communication University of North Carolina - Chapel Hill ITS - Research Computing Instructor: Mark Reed

Slides:

Advertisements

Similar presentations

MPI Collective Communications

Advertisements

1 Collective Operations Dr. Stephen Tse Lesson 12.

MPI_Gatherv CISC372 Fall 2006 Andrew Toy Tom Lynch Bill Meehan.

MPI_REDUCE() Philip Madron Eric Remington. Basic Overview MPI_Reduce() simply applies an MPI operation to select local memory values on each process,

HPDC Spring MPI 11 CSCI-6964: High Performance Parallel & Distributed Computing (HPDC) AE 216, Mon/Thurs. 2 – 3:20 p.m Message Passing Interface.

12c.1 Collective Communication in MPI UNC-Wilmington, C. Ferner, 2008 Nov 4, 2008.

SOME BASIC MPI ROUTINES With formal datatypes specified.

MPI Collective Communication CS 524 – High-Performance Computing.

EECC756 - Shaaban #1 lec # 7 Spring Message Passing Interface (MPI) MPI, the Message Passing Interface, is a library, and a software standard.

MPI Workshop - II Research Staff Week 2 of 3.

Collective Communications

Collective Communication.  Collective communication is defined as communication that involves a group of processes  More restrictive than point to point.

Message Passing Interface. Message Passing Interface (MPI) Message Passing Interface (MPI) is a specification designed for parallel applications. The.

Distributed Systems CS Programming Models- Part II Lecture 17, Nov 2, 2011 Majd F. Sakr, Mohammad Hammoud andVinay Kolar 1.

1 Parallel Programming with MPI- Day 3 Science & Technology Support High Performance Computing Ohio Supercomputer Center 1224 Kinnear Road Columbus, OH.

Sahalu JunaiduICS 573: High Performance Computing6.1 Programming Using the Message Passing Paradigm Principles of Message-Passing Programming The Building.

Parallel Programming with Java

Parallel Programming with MPI Matthew Pratola

ORNL is managed by UT-Battelle for the US Department of Energy Crash Course In Message Passing Interface Adam Simpson NCCS User Assistance.

Parallel Processing1 Parallel Processing (CS 676) Lecture 7: Message Passing using MPI * Jeremy R. Johnson *Parts of this lecture was derived from chapters.

Chapter 6 Parallel Sorting Algorithm Sorting Parallel Sorting Bubble Sort Odd-Even (Transposition) Sort Parallel Odd-Even Transposition Sort Related Functions.

Parallel Programming and Algorithms – MPI Collective Operations David Monismith CS599 Feb. 10, 2015 Based upon MPI: A Message-Passing Interface Standard.

2a.1 Message-Passing Computing More MPI routines: Collective routines Synchronous routines Non-blocking routines ITCS 4/5145 Parallel Computing, UNC-Charlotte,

1 Collective Communications. 2 Overview  All processes in a group participate in communication, by calling the same function with matching arguments.

1 MPI: Message-Passing Interface Chapter 2. 2 MPI - (Message Passing Interface) Message passing library standard (MPI) is developed by group of academics.

PP Lab MPI programming VI. Program 1 Break up a long vector into subvectors of equal length. Distribute subvectors to processes. Let them compute the.

Parallel Programming with MPI Prof. Sivarama Dandamudi School of Computer Science Carleton University.

CS 838: Pervasive Parallelism Introduction to MPI Copyright 2005 Mark D. Hill University of Wisconsin-Madison Slides are derived from an online tutorial.

Summary of MPI commands Luis Basurto. Large scale systems Shared Memory systems – Memory is shared among processors Distributed memory systems – Each.

MPI Introduction to MPI Commands. Basics – Send and Receive MPI is a message passing environment. The processors’ method of sharing information is NOT.

Parallel Programming with MPI By, Santosh K Jena..

Lecture 6: Message Passing Interface (MPI). Parallel Programming Models Message Passing Model Used on Distributed memory MIMD architectures Multiple processes.

CSCI-455/522 Introduction to High Performance Computing Lecture 4.

Oct. 23, 2002Parallel Processing1 Parallel Processing (CS 730) Lecture 6: Message Passing using MPI * Jeremy R. Johnson *Parts of this lecture was derived.

MPI Jakub Yaghob. Literature and references Books Gropp W., Lusk E., Skjellum A.: Using MPI: Portable Parallel Programming with the Message-Passing Interface,

2.1 Collective Communication Involves set of processes, defined by an intra-communicator. Message tags not present. Principal collective operations: MPI_BCAST()

-1.1- MPI Lectured by: Nguyễn Đức Thái Prepared by: Thoại Nam.

Message Passing Programming Based on MPI Collective Communication I Bora AKAYDIN

MPI Derived Data Types and Collective Communication

Message Passing Interface Using resources from

Distributed Systems CS Programming Models- Part II Lecture 14, Oct 28, 2013 Mohammad Hammoud 1.

COMP7330/7336 Advanced Parallel and Distributed Computing MPI Programming - Exercises Dr. Xiao Qin Auburn University

COMP7330/7336 Advanced Parallel and Distributed Computing MPI Programming: 1. Collective Operations 2. Overlapping Communication with Computation Dr. Xiao.

ITCS 4/5145 Parallel Computing, UNC-Charlotte, B

1 MPI: Message Passing Interface Prabhaker Mateti Wright State University.

Distributed Processing with MPI International Summer School 2015 Tomsk Polytechnic University Assistant Professor Dr. Sergey Axyonov.

Computer Science Department

Introduction to MPI Programming Ganesh C.N.

Collectives Reduce Scatter Gather Many more.

MPI Jakub Yaghob.

CS4402 – Parallel Computing

Introduction to MPI Programming

Computer Science Department

Send and Receive.

Collective Communication with MPI

An Introduction to Parallel Programming with MPI

Collective Communication Operations

Send and Receive.

Distributed Systems CS

ITCS 4/5145 Parallel Computing, UNC-Charlotte, B

Distributed Systems CS

CS 5334/4390 Spring 2017 Rogelio Long

High Performance Parallel Programming

MPI: Message Passing Interface

Message-Passing Computing More MPI routines: Collective routines Synchronous routines Non-blocking routines ITCS 4/5145 Parallel Computing, UNC-Charlotte,

ITCS 4/5145 Parallel Computing, UNC-Charlotte, B

Hardware Environment VIA cluster - 8 nodes Blade Server – 5 nodes

Computer Science Department

5- Message-Passing Programming

Parallel Processing - MPI

Presentation transcript:

its.unc.edu 1 Collective Communication University of North Carolina - Chapel Hill ITS - Research Computing Instructor: Mark Reed

its.unc.edu 2 Collective Communication  Communications involving a group of processes.  Called by all processes in a communicator.  Examples: Barrier synchronization Broadcast, scatter, gather. Global sum, global maximum, etc.

its.unc.edu 3 Characteristics of Collective Communication  Collective action over a communicator  All processes must communicate  Synchronization may or may not occur  All collective operations are blocking.  No tags.  Receive buffers must be exactly the right size This restriction was made to simplify the standard (e.g. no receive status required)

its.unc.edu 4 Why use collective operations?  Efficiency  Clarity  Convenience  Robustness  Flexibility  Programming Style – all collective ops

its.unc.edu 5 Barrier Synchronization  int MPI_Barrier (MPI_Comm comm) all processes within the communicator, comm, are synchronized.

its.unc.edu 6 Broadcast A0 data Processors

its.unc.edu 7 Broadcast  int MPI_Bcast (void *buffer, int count, MPI_Datatype datatype, int root, MPI_Comm comm) buffer - address of data to broadcast count - number of items to send datatype - MPI datatype root - process from which data is sent comm - MPI communicator

its.unc.edu 8 Gather/Scatter A0 A1 A2 A3 data A1A2A3 scatter gather Processors

its.unc.edu 9 Gather  int MPI_Gather (void* sendbuf, int sendcount, MPI_Datatype sendtype, void* recvbuf, int recvcount, MPI_Datatype recvtype, int root, MPI_Comm comm) sendbuf - each process (including root) sends this buffer to root sendcount - number of elements in the send buffer recvbuf - address of receive buffer, this is ignored for all non-root processes. Gather buffer should be large enough to hold results from all processors. recvcount - number of elements for any single receive root - rank of receiving process

its.unc.edu 10 Scatter  int MPI_Scatter (void* sendbuf, int sendcount, MPI_Datatype sendtype, void* recvbuf, int recvcount, MPI_Datatype recvtype, int root, MPI_Comm comm) sendbuf - address of send buffer, this is ignored for all non-root processes sendcount - number of elements sent to each process recvbuf - address of receive buffer recvcount - number of elements in the receive buffer root - rank of sending process

its.unc.edu 11 Allgather A0 data A0 B0 C0 D0 B0 C0 D0 Processors

its.unc.edu 12 Allgather  int MPI_Allgather (void* sendbuf, int sendcount, MPI_Datatype sendtype, void* recvbuf, int recvcount, MPI_Datatype recvtype, MPI_Comm comm) sendbuf - starting address of send buffer sendcount - number of elements in the send buffer recvbuf - address of receive buffer recvcount - number of elements for any single receive  Like gather except all processes receive result

its.unc.edu 13 Alltoall A0 A1 A2 A3 data A0 B0 C0 D0 B0 B1 B2 B3 C0 C1 C2 C3 D0 D1 D2 D3 A1 B1 C1 D1 A2 B2 C2 D2 A3 B3 C3 D3 Processors

its.unc.edu 14 All to All Scatter/Gather  int MPI_Alltoall (void* sendbuf, int sendcount, MPI_Datatype sendtype, void* recvbuf, int recvcount, MPI_Datatype recvtype, MPI_Comm comm) sendbuf - starting address of send buffer sendcount - number of elements in the send buffer recvbuf - address of receive buffer recvcount - number of elements for any single receive  Extension of Allgather to the case where each process sends distinct data to each receiver

its.unc.edu 15 Additional Routines  Vector versions exist for all of these which allow a varying count of data from each process  MPI_Alltoallw provides the most general form of communication and indeed generalizes all of the vector variants allows separate specification of count, byte displacement and datatype for each block

its.unc.edu 16 Identical send/receive buffers  The send and receive data buffers for all collective operations are normally two distinct (disjoint) buffers.  To reuse the same buffer, specify the MPI constant MPI_IN_PLACE instead of the send or receive buffer argument.

its.unc.edu 17 Global Reduction Operations  Used to compute a result involving data distributed over a group of processes.  Examples: global sum or product global maximum or minimum global user-defined operation

its.unc.edu 18 Predefined Reduction Operations MPI_MAX – maximum value MPI_MIN – minimum value MPI_SUM - sum MPI_PROD - product MPI_LAND, MPI_BAND – logical, bitwise AND MPI_LOR, MPI_BOR – logical, bitwise OR MPI_LXOR, MPI_BXOR - logical, bitwise exclusive OR MPI_MAXLOC –max value and it’s location MPI_MINLOC - min value and it’s location

its.unc.edu 19 User Defined Operators  In addition to the predefined reduction operations, the user may define their own  Use MPI_Op_Create to register the operation with MPI  it must be of type: C: MPI_User_function (…) Fortran: Function User_Function(…)

its.unc.edu 20 Reduce data A0 A1 A2 B0 B1 B2 C0 C1 C2 A0+A1+A2B0+B1+B2C0+C1+C2 Processors

its.unc.edu 21 Reduce  int MPI_Reduce (void* sendbuf, void* recvbuf, int count, MPI_Datatype type, MPI_Op op, int root, MPI_Comm comm) sendbuf - address of send buffer recvbuf - address of receive buffer count - number of elements in send buffer op - reduce operation root - rank of receiving process

its.unc.edu 22 Variants of MPI_REDUCE  MPI_ALLREDUCE no root process  MPI_REDUCE_SCATTER result is scattered  MPI_SCAN “parallel prefix”

its.unc.edu 23 Allreduce data A0 A1 A2 B0 B1 B2 C0 C1 C2 A0+A1+A2B0+B1+B2C0+C1+C2 A0+A1+A2B0+B1+B2C0+C1+C2 A0+A1+A2B0+B1+B2C0+C1+C2 Processors

its.unc.edu 24 Allreduce  int MPI_Allreduce (void* sendbuf, void* recvbuf, int count, MPI_Datatype type, MPI_Op op, MPI_Comm comm) sendbuf - address of send buffer recvbuf - address of receive buffer count - number of elements in send buffer op - reduce operation

its.unc.edu 25 Reduce - Scatter data A0 A1 A2 B0 B1 B2 C0 C1 C2 A0+A1+A2 B0+B1+B2 C0+C1+C2 Processors

its.unc.edu 26 Scan data A0 A1 A2 B0 B1 B2 C0 C1 C2 A0B0C0 A0+A1B0+B1C0+C1 A0+A1+A2B0+B1+B2C0+C1+C2 Processors

its.unc.edu 27 Example: Find the maximum value of a function over some range  Let’s pick a simple function y = cos(2  x+  /4)  Our range will be 0-> 2  so x will vary from 0-> 1.  Break x range into evenly sized blocks on each processor

its.unc.edu 28 Example: Find the maximum value of a function /* program findmax */ /* find the maximum of the function y=cos(2*pi*x + pi/4) across processors */ #include #define numpts 100 main(int argc, char* argv[]) {

its.unc.edu 29 int numprocs,i,myrank; float pi, twopi, phase, blksze,x,delx; struct { float data; int idx; } y[100],ymax, ymymax; /* Compute number of processes and myrank */ MPI_Init(&argc, &argv); MPI_Comm_size(MPI_COMM_WORLD, &numprocs); MPI_Comm_rank(MPI_COMM_WORLD, &myrank); findmax cont.

its.unc.edu 30 findmax cont. /* compute the block size and spacing of x points*/ blksze = 1.0/numprocs; /* 1.0 is max value of x */ delx = blksze/numpts; /* define some constants */ pi=acos(-1.0); phase = pi/4; twopi = 2*pi; /* note: x from 0 to 1-delx s.t. y over 2pi */ /* initialize across all processors */ for (i=0;i<numpts;i++) { x = blksze*myrank + i*delx; y[i].idx = numpts*myrank + i; y[i].data = cos(twopi*x+phase); }

its.unc.edu 31 findmax cont. /* Now find the max on each local processor */ ymymax.data = -FLT_MAX; ymymax.idx = 0; for (i=0;i<numpts;i++) { if (y[i].data > ymymax.data) { ymymax.data = y[i].data; ymymax.idx = y[i].idx; } /* Now find the max across the processors */ MPI_Reduce (&ymymax,ymax,1,MPI_FLOAT_INT, MPI_MAXLOC, 0, MPI_COMM_WORLD);

its.unc.edu 32 findmax cont. /* now print out answer */ if (myrank == 0) { printf("The maximum value of the function is %f which occurs at x = %f \n", ymax.data, ymax.idx*delx); } /* call barrier and exit */ MPI_Barrier(MPI_COMM_WORLD); MPI_Finalize (); }

its.unc.edu 33 Results  Results for numpts=10,000 : The maximum value of the function is which occurs at x = i.e at x=7/8 or 2  x= 1 ¾   In the previous example we found the minimum in data space and then reduced across processor space. We could have reversed this and reduced across processor space first and then found the minimum. What are the disadvantages of this second method?