Collective Communication Operations

Slides:

Advertisements

Similar presentations

Basic Communication Operations

Advertisements

Lecture 9: Group Communication Operations

Its.unc.edu 1 Collective Communication University of North Carolina - Chapel Hill ITS - Research Computing Instructor: Mark Reed

MPI Collective Communications

1 Collective Operations Dr. Stephen Tse Lesson 12.

Sahalu Junaidu ICS 573: High Performance Computing 8.1 Topic Overview Matrix-Matrix Multiplication Block Matrix Operations A Simple Parallel Matrix-Matrix.

MPI_AlltoAllv Function Outline int MPI_Alltoallv ( void *sendbuf, int *sendcnts, int *sdispls, MPI_Datatype sendtype, void *recvbuf, int *recvcnts, int.

HPDC Spring MPI 11 CSCI-6964: High Performance Parallel & Distributed Computing (HPDC) AE 216, Mon/Thurs. 2 – 3:20 p.m Message Passing Interface.

1 Friday, October 20, 2006 “Work expands to fill the time available for its completion.” -Parkinson’s 1st Law.

1 Lecture 8 Architecture Independent (MPI) Algorithm Design Parallel Computing Fall 2007.

12c.1 Collective Communication in MPI UNC-Wilmington, C. Ferner, 2008 Nov 4, 2008.

SOME BASIC MPI ROUTINES With formal datatypes specified.

1 Tuesday, October 03, 2006 If I have seen further, it is by standing on the shoulders of giants. -Isaac Newton.

MPI Collective Communication CS 524 – High-Performance Computing.

Topic Overview One-to-All Broadcast and All-to-One Reduction

Collective Communication.  Collective communication is defined as communication that involves a group of processes  More restrictive than point to point.

Distributed Systems CS Programming Models- Part II Lecture 17, Nov 2, 2011 Majd F. Sakr, Mohammad Hammoud andVinay Kolar 1.

Parallel Programming with Java

L15: Putting it together: N-body (Ch. 6) October 30, 2012.

Basic Communication Operations Based on Chapter 4 of Introduction to Parallel Computing by Ananth Grama, Anshul Gupta, George Karypis and Vipin Kumar These.

Chapter 6 Parallel Sorting Algorithm Sorting Parallel Sorting Bubble Sort Odd-Even (Transposition) Sort Parallel Odd-Even Transposition Sort Related Functions.

Parallel Programming and Algorithms – MPI Collective Operations David Monismith CS599 Feb. 10, 2015 Based upon MPI: A Message-Passing Interface Standard.

1 Collective Communications. 2 Overview  All processes in a group participate in communication, by calling the same function with matching arguments.

Chapter 3 Parallel Algorithm Design. Outline Task/channel model Task/channel model Algorithm design methodology Algorithm design methodology Case studies.

Topic Overview One-to-All Broadcast and All-to-One Reduction All-to-All Broadcast and Reduction All-Reduce and Prefix-Sum Operations Scatter and Gather.

PP Lab MPI programming VI. Program 1 Break up a long vector into subvectors of equal length. Distribute subvectors to processes. Let them compute the.

Parallel Programming with MPI Prof. Sivarama Dandamudi School of Computer Science Carleton University.

CS 838: Pervasive Parallelism Introduction to MPI Copyright 2005 Mark D. Hill University of Wisconsin-Madison Slides are derived from an online tutorial.

Summary of MPI commands Luis Basurto. Large scale systems Shared Memory systems – Memory is shared among processors Distributed memory systems – Each.

Graph Algorithms Ananth Grama, Anshul Gupta, George Karypis, and Vipin Kumar Adapted for 3030 To accompany the text ``Introduction to Parallel Computing'',

Parallel Programming with MPI By, Santosh K Jena..

Basic Communication Operations Ananth Grama, Anshul Gupta, George Karypis, and Vipin Kumar Reduced slides for CSCE 3030 To accompany the text ``Introduction.

1. 2 The logical view of a machine supporting the message-passing paradigm consists of p processes, each with its own exclusive address space. The logical.

Basic Communication Operations Carl Tropper Department of Computer Science.

Message Passing Programming Based on MPI Collective Communication I Bora AKAYDIN

MPI Derived Data Types and Collective Communication

Distributed Systems CS Programming Models- Part II Lecture 14, Oct 28, 2013 Mohammad Hammoud 1.

COMP7330/7336 Advanced Parallel and Distributed Computing MPI Programming - Exercises Dr. Xiao Qin Auburn University

COMP7330/7336 Advanced Parallel and Distributed Computing MPI Programming: 1. Collective Operations 2. Overlapping Communication with Computation Dr. Xiao.

ITCS 4/5145 Parallel Computing, UNC-Charlotte, B

Distributed Processing with MPI International Summer School 2015 Tomsk Polytechnic University Assistant Professor Dr. Sergey Axyonov.

Computer Science Department

Verification of Data-Dependent Properties of MPI-Based Parallel Scientific Software Anastasia Mironova.

Auburn University

Introduction to MPI Programming Ganesh C.N.

Auburn University

Auburn University COMP7330/7336 Advanced Parallel and Distributed Computing Odd-Even Sort Implementation Dr. Xiao Qin.

Auburn University COMP7330/7336 Advanced Parallel and Distributed Computing Message Passing Interface (cont.) Topologies.

CS4402 – Parallel Computing

Computer Science Department

Send and Receive.

Collective Communication with MPI

Auburn University COMP7330/7336 Advanced Parallel and Distributed Computing Principles of Message-Passing Programming.

An Introduction to Parallel Programming with MPI

Send and Receive.

Distributed Systems CS

ITCS 4/5145 Parallel Computing, UNC-Charlotte, B

Distributed Systems CS

Lecture 14: Inter-process Communication

All-to-All Pattern A pattern where all (slave) processes can communicate with each other Somewhat the worst case scenario! ITCS 4/5145 Parallel Computing,

COMP60621 Fundamentals of Parallel and Distributed Systems

ITCS 4/5145 Parallel Computing, UNC-Charlotte, B

To accompany the text “Introduction to Parallel Computing”,

Parallel Programming in C with MPI and OpenMP

Synchronizing Computations

COMP60611 Fundamentals of Parallel and Distributed Systems

Computer Science Department

5- Message-Passing Programming

Chapter 2 from ``Introduction to Parallel Computing'',

Presentation transcript:

Collective Communication Operations Adapted by Shirley Moore from Ananth Grama, Anshul Gupta, George Karypis, and Vipin Kumar, “Introduction to Parallel Computing”, Addison Wesley, 2003, Ch 4,6 UTEP CS5334/4390 March 23, 2017

Topic Overview Motivation for collective communication One-to-All Broadcast and All-to-One Reduction All-to-All Broadcast and Reduction All-Reduce and Prefix-Sum Operations Scatter and Gather All-to-All Personalized Communication

Collective Communication: Motivation Many interactions in practical parallel programs occur in well-defined patterns involving groups of processors. Efficient implementations of these operations can improve performance, reduce development effort and cost, and improve software quality. Efficient implementations leverage the underlying architecture.

Collective Communication: Assumptions Group communication operations are built using point-to-point messaging primitives. Recall that communicating a message of size m over an uncongested network takes time ts +twm. Where necessary, we can take congestion into account explicitly by scaling the tw term. We assume that the network is bidirectional and that communication is single-ported.

One-to-All Broadcast and All-to-One Reduction One processor has a piece of data (of size m) it needs to send to everyone. The dual of one-to-all broadcast is all-to-one reduction. In all-to-one reduction, each processor has m units of data. These data items must be combined piece-wise (using some associative operator, such as addition or min), and the result made available at a target processor.

One-to-All Broadcast and All-to-One Reduction One-to-all broadcast and all-to-one reduction among processors.

Broadcast and Reduction: Example Consider the problem of multiplying a matrix with a vector. The n x n matrix is assigned to an n x n (virtual) processor grid. The vector is assumed to be on the first row of processors. The first step of the product requires a one-to-all broadcast of the vector element along the corresponding column of processors. This can be done concurrently for all n columns. The processors compute the local product of the vector element and the local matrix entry. In the final step, the results of these products are accumulated to the first column using n concurrent all-to-one reduction operations along the rows (using the sum operation).

Broadcast and Reduction: Matrix-Vector Multiplication Example One-to-all broadcast and all-to-one reduction in the multiplication of a 4 x 4 matrix with a 4 x 1 vector.

Question for Thought: Communication Cost? What is the communication cost of an efficient implementation of one-to-all broadcast (all-to-one reduction)? Ignore network congestion

Cost Analysis The broadcast or reduction procedure involves log p point-to-point message steps, each at a time cost of ts + twm. The total parallel communication time is therefore given by:

MPI Collective Communication Operations The barrier synchronization operation is performed in MPI using: int MPI_Barrier(MPI_Comm comm) The one-to-all broadcast operation is: int MPI_Bcast(void *buf, int count, MPI_Datatype datatype, int source, MPI_Comm comm) The all-to-one reduction operation is: int MPI_Reduce(void *sendbuf, void *recvbuf, int count, MPI_Datatype datatype, MPI_Op op, int target, MPI_Comm comm) See OpenMPI man pages for details https://www.open-mpi.org/doc/v2.0/

Predefined MPI Reduction Operations

Class Exercise: MPI_Reduce Modify the Monte Carlo pi program from the Introduction to MPI class to use MPI_Reduce to aggregate the results.

All-to-All Broadcast and Reduction Generalization of broadcast in which each processor is the source as well as destination. A process sends the same m-word message to every other process, but different processes may broadcast different messages.

All-to-All Broadcast and Reduction All-to-all broadcast and all-to-all reduction.

MPI Collective Operations MPI provides the MPI_Allgather function in which the data are gathered at all the processes – i.e., all-to-all broadcast int MPI_Allgather(void *sendbuf, int sendcount, MPI_Datatype senddatatype, void *recvbuf, int recvcount, MPI_Datatype recvdatatype, MPI_Comm comm) MPI provides the MPI_Reduce_scatter which is equivalent to a reduce followed by a scatter. int MPI_Reduce_scatter(const void *sendbuf, void *recvbuf, const int recvcounts[], MPI_Datatype datatype, MPI_Op op, MPI_Comm comm)

All-Reduce Operation In all-reduce, each node starts with a buffer of size m and the final results of the operation are identical buffers of size m on each node that are formed by combining the original p buffers using an associative operator. Identical to all-to-one reduction followed by a one-to-all broadcast (but this implementation is not the most efficient). Different from all-to-all reduction, in which p simultaneous all-to-one reductions take place, each with a different destination for the result.

The Prefix-Sum Operation Given p numbers n0,n1,…,np-1 (one on each node), the problem is to compute the sums sk = ∑ik= 0 ni for all k between 0 and p-1 . Initially, nk resides on the node labeled k, and at the end of the procedure, the same node holds Sk.

MPI Collective Communication Operations If the result of the reduction operation is needed by all processes, MPI provides: int MPI_Allreduce(void *sendbuf, void *recvbuf, int count, MPI_Datatype datatype, MPI_Op op, MPI_Comm comm) To compute prefix-sums, MPI provides: int MPI_Scan(void *sendbuf, void *recvbuf, int count, MPI_Datatype datatype, MPI_Op op, MPI_Comm comm)

Scatter and Gather In the scatter operation, a single node sends a unique message of size m to every other node (also called a one-to-all personalized communication). In the gather operation, a single node collects a unique message from each node. While the scatter operation is fundamentally different from broadcast, the algorithmic structure is similar, except for differences in message sizes (messages get smaller in scatter and stay constant in broadcast). The gather operation is exactly the inverse of the scatter operation and can be executed as such.

Gather and Scatter Operations Scatter and gather operations.

Cost of Scatter and Gather There are log p steps, in each step, the machine size halves and the data size halves. Thus the time for this operation is:

MPI Collective Operations The gather operation is performed in MPI using: int MPI_Gather(void *sendbuf, int sendcount, MPI_Datatype senddatatype, void *recvbuf, int recvcount, MPI_Datatype recvdatatype, int target, MPI_Comm comm) The corresponding scatter operation is: int MPI_Scatter(void *sendbuf, int sendcount, int source, MPI_Comm comm)

All-to-All Personalized Communication Each node has a distinct message of size m for every other node. This is unlike all-to-all broadcast, in which each node sends the same message to all other nodes. All-to-all personalized communication is also known as total exchange.

All-to-All Personalized Communication

All-to-All Personalized Communication: Example Consider the problem of transposing a matrix. Each processor contains one full row of the matrix. The transpose operation in this case is identical to an all-to-all personalized communication operation.

All-to-All Personalized Communication: Example All-to-all personalized communication in transposing a 4 x 4 matrix using four processes.

MPI Collective Operation The all-to-all personalized communication operation is performed by: int MPI_Alltoall(void *sendbuf, int sendcount, MPI_Datatype senddatatype, void *recvbuf, int recvcount, MPI_Datatype recvdatatype, MPI_Comm comm)

Class Exercise Write an MPI program called rowsum to scatter the rows of a matrix from the root to all the processes, have each process compute the sum of its row, and gather the row sums back to the root process.