Sahalu Junaidu ICS 573: High Performance Computing 8.1 Topic Overview Matrix-Matrix Multiplication Block Matrix Operations A Simple Parallel Matrix-Matrix.

Slides:



Advertisements
Similar presentations
1 Non-Blocking Communications. 2 #include int main(int argc, char **argv) { int my_rank, ncpus; int left_neighbor, right_neighbor; int data_received=-1;
Advertisements

MPI Collective Communications
Dense Matrix Algorithms. Topic Overview Matrix-Vector Multiplication Matrix-Matrix Multiplication Solving a System of Linear Equations.
Parallel Matrix Operations using MPI CPS 5401 Fall 2014 Shirley Moore, Instructor November 3,
Introduction to MPI Programming (Part III)‏ Michael Griffiths, Deniz Savas & Alan Real January 2006.
Numerical Algorithms Matrix multiplication
Parallel Programming: Techniques and Applications Using Networked Workstations and Parallel Computers Chapter 11: Numerical Algorithms Sec 11.2: Implementing.
Numerical Algorithms • Matrix multiplication
HPDC Spring MPI 11 CSCI-6964: High Performance Parallel & Distributed Computing (HPDC) AE 216, Mon/Thurs. 2 – 3:20 p.m Message Passing Interface.
1 Friday, October 20, 2006 “Work expands to fill the time available for its completion.” -Parkinson’s 1st Law.
Chapter 6 Floyd’s Algorithm. 2 Chapter Objectives Creating 2-D arrays Thinking about “grain size” Introducing point-to-point communications Reading and.
Point-to-Point Communication Self Test with solution.
SOME BASIC MPI ROUTINES With formal datatypes specified.
Chapter 6 Floyd’s Algorithm. 2 Chapter Objectives Creating 2-D arrays Thinking about “grain size” Introducing point-to-point communications Reading and.
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Chapter 8 Matrix-vector Multiplication.
MPI Point-to-Point Communication CS 524 – High-Performance Computing.
Chapter 5, CLR Textbook Algorithms on Grids of Processors.
Topic Overview One-to-All Broadcast and All-to-One Reduction
Design of parallel algorithms Matrix operations J. Porras.
Lecture 8 Objectives Material from Chapter 9 More complete introduction of MPI functions Show how to implement manager-worker programs Parallel Algorithms.
Dense Matrix Algorithms CS 524 – High-Performance Computing.
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Parallel Programming in C with MPI and OpenMP Michael J. Quinn.
1 Tuesday, October 31, 2006 “Data expands to fill the space available for storage.” -Parkinson’s Law.
A Brief Look At MPI’s Point To Point Communication Brian T. Smith Professor, Department of Computer Science Director, Albuquerque High Performance Computing.
Today Objectives Chapter 6 of Quinn Creating 2-D arrays Thinking about “grain size” Introducing point-to-point communications Reading and printing 2-D.
CS 179: GPU Programming Lecture 20: Cross-system communication.
L15: Putting it together: N-body (Ch. 6) October 30, 2012.
Basic Communication Operations Based on Chapter 4 of Introduction to Parallel Computing by Ananth Grama, Anshul Gupta, George Karypis and Vipin Kumar These.
1 MPI: Message-Passing Interface Chapter 2. 2 MPI - (Message Passing Interface) Message passing library standard (MPI) is developed by group of academics.
MA471Fall 2003 Lecture5. More Point To Point Communications in MPI Note: so far we have covered –MPI_Init, MPI_Finalize –MPI_Comm_size, MPI_Comm_rank.
Parallel Computing A task is broken down into tasks, performed by separate workers or processes Processes interact by exchanging information What do we.
Parallel Programming with MPI Prof. Sivarama Dandamudi School of Computer Science Carleton University.
An Introduction to Parallel Programming with MPI March 22, 24, 29, David Adams
Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M
1 Overview on Send And Receive routines in MPI Kamyar Miremadi November 2004.
CS4402 – Parallel Computing
Parallel Programming in C with MPI and OpenMP Michael J. Quinn.
MA471Fall 2002 Lecture5. More Point To Point Communications in MPI Note: so far we have covered –MPI_Init, MPI_Finalize –MPI_Comm_size, MPI_Comm_rank.
Its.unc.edu 1 University of North Carolina - Chapel Hill ITS Research Computing Instructor: Mark Reed Point to Point Communication.
MPI Point to Point Communication CDP 1. Message Passing Definitions Application buffer Holds the data for send or receive Handled by the user System buffer.
MPI Workshop - III Research Staff Cartesian Topologies in MPI and Passing Structures in MPI Week 3 of 3.
1. 2 The logical view of a machine supporting the message-passing paradigm consists of p processes, each with its own exclusive address space. The logical.
Winter 2014Parallel Processing, Fundamental ConceptsSlide 1 2 A Taste of Parallel Algorithms Learn about the nature of parallel algorithms and complexity:
MPI Send/Receive Blocked/Unblocked Josh Alexander, University of Oklahoma Ivan Babic, Earlham College Andrew Fitz Gibbon, Shodor Education Foundation Inc.
Lecture 9 Architecture Independent (MPI) Algorithm Design
Chapter 5. Nonblocking Communication MPI_Send, MPI_Recv are blocking operations Will not return until the arguments to the functions can be safely modified.
PARALLEL COMPUTATION FOR MATRIX MULTIPLICATION Presented By:Dima Ayash Kelwin Payares Tala Najem.
Parallel Algorithms & Implementations: Data-Parallelism, Asynchronous Communication and Master/Worker Paradigm FDI 2007 Track Q Day 2 – Morning Session.
Message Passing Programming Based on MPI Collective Communication I Bora AKAYDIN
Message Passing Interface Using resources from
An Introduction to Parallel Programming with MPI February 17, 19, 24, David Adams
COMP7330/7336 Advanced Parallel and Distributed Computing MPI Programming: 1. Collective Operations 2. Overlapping Communication with Computation Dr. Xiao.
CSc 8530 Matrix Multiplication and Transpose By Jaman Bhola.
MPI_Alltoall By: Jason Michalske. What is MPI_Alltoall? Each process sends distinct data to each receiver. The Jth block of process I is received by process.
Numerical Algorithms Chapter 11.
Computer Science Department
Auburn University COMP7330/7336 Advanced Parallel and Distributed Computing Odd-Even Sort Implementation Dr. Xiao Qin.
Auburn University COMP7330/7336 Advanced Parallel and Distributed Computing Message Passing Interface (cont.) Topologies.
Computer Science Department
Parallel Programming with MPI and OpenMP
Collective Communication Operations
Lecture 14: Inter-process Communication
Parallel Matrix Operations
CSCE569 Parallel Computing
To accompany the text “Introduction to Parallel Computing”,
Synchronizing Computations
September 4, 1997 Parallel Processing (CS 667) Lecture 9: Advanced Point to Point Communication Jeremy R. Johnson *Parts of this lecture was derived.
Computer Science Department
5- Message-Passing Programming
September 4, 1997 Parallel Processing (CS 730) Lecture 9: Advanced Point to Point Communication Jeremy R. Johnson *Parts of this lecture was derived.
Presentation transcript:

Sahalu Junaidu ICS 573: High Performance Computing 8.1 Topic Overview Matrix-Matrix Multiplication Block Matrix Operations A Simple Parallel Matrix-Matrix Multiplication Cannon's Algorithm Overlapping Communication with Computation

Sahalu Junaidu ICS 573: High Performance Computing 8.2 Matrix-Matrix Multiplication Building on our matrix-vector multiplication (Quinn’s Chapter 8), we now consider matrix-matrix multiplication –multiplying two n x n dense, square matrices A and B to yield the product matrix C = A x B. For simplicity, we use the following serial algorithm:

Sahalu Junaidu ICS 573: High Performance Computing 8.3 Block Matrix Operations Matrix computations involving scalar algebraic operations on the matrix elements can be expressed in terms of identical operations on submatrices of the original matrix. Such algebraic operations on the submatrices are called block matrix operations. –useful in matrix multiplication as well as in a variety of other matrix algorithms In this view, an n x n matrix A can be regarded as a q x q array of blocks A i,j (0 ≤ i, j < q) such that each block is an (n/q) x (n/q) submatrix. We perform q 3 matrix multiplications, each involving (n/q) x (n/q) matrices. –requiring (n/q) 3 additions and multiplications

Sahalu Junaidu ICS 573: High Performance Computing 8.4 Block Matrix Operations

Sahalu Junaidu ICS 573: High Performance Computing 8.5 A Simple Parallel Matrix-Matrix Multiplication Algorithm Consider two n x n matrices A and B partitioned into p blocks A i,j and B i,j ( 0 ≤ i, j < ) of size each. Process P i,j initially stores A i,j and B i,j and computes block C i,j of the result matrix. Computing submatrix C i,j requires all submatrices A i,k and B k,j for 0 ≤ k <. All-to-all broadcast blocks of A along rows and B along columns. Perform local submatrix multiplication.

Sahalu Junaidu ICS 573: High Performance Computing 8.6 Matrix-Matrix Multiplication: Performance Analysis The two broadcasts take time The computation requires multiplications of sized submatrices. The parallel run time is approximately

Sahalu Junaidu ICS 573: High Performance Computing 8.7 Drawback of the Simple Parallel Algorithm A major drawback of this algorithm is that it is not memory optimal Each process has blocks of both matrices A and B at the end of each communication phase Thus, each process requires memory –Since each block requires memory The total memory requirement over all the processes is i.e., times the memory requirement of the sequential algorithm.

Sahalu Junaidu ICS 573: High Performance Computing 8.8 Matrix-Matrix Multiplication: Cannon's Algorithm Cannon's algorithm is a memory-efficient version of the simple parallel algorithm –With a total memory requirement of  (n 2 ) Matrices A and B are partitioned into p square blocks as in the simple parallel algorithm Although every process in the i th row requires all submatrices, the all-to-all broadcast can be avoided by –scheduling the computations of the processes of the i th row such that, at any given time, each process is using a different block A i,k. –systematically rotating these blocks among the processes after every submatrix multiplication so that every process gets a fresh A i,k after each rotation. If an identical schedule is applied to the columns of B, then no process holds more than one block of each matrix at any time

Sahalu Junaidu ICS 573: High Performance Computing 8.9 Communication Steps in Cannon's Algorithm

Sahalu Junaidu ICS 573: High Performance Computing 8.10 Communication Steps in Cannon's Algorithm First, align the blocks of A and B in such a way that each process multiplies its local submatrices: –shift submatrices A i,j to the left (with wraparound) by i steps –shift submatrices B i,j up (with wraparound) by j steps. After alignment (Figure 8.3c): –Process P i,j has submatrices and. –Perform local block matrix multiplication. Next, each block of A moves one step left and each block of B moves one step up (again with wraparound). Perform next block multiplication, add to partial result, repeat until all the blocks have been multiplied.

Sahalu Junaidu ICS 573: High Performance Computing 8.11 Cannon's Algorithm: An Example Consider the matrices to be multiplied: Assume that these matrices are portioned into 4 square blocks as follows: After the initial alignment, matrices A and B become:

Sahalu Junaidu ICS 573: High Performance Computing 8.12 Cannon's Algorithm: An Example After this alignment, process –P 0,0 ends up with A 0,0 and B 0,0 and should compute c 0,0 –P 0,1 ends up with A 0,1 and B 1,1 and should compute c 0,1 –P 1,0 ends up with A 1,1 and B 1,0 and should compute c 1,0 –P 1,1 ends up with A 1,0 and B 0,1 and should compute c 1,1 The local block matrix multiplications proceed as follows:

Sahalu Junaidu ICS 573: High Performance Computing 8.13 Cannon's Algorithm: An Example Shift 1: shift each block of A one step to the left and shift each block of B one step up: Next, each process P i,j performs block multiplication, updating C i,j :

Sahalu Junaidu ICS 573: High Performance Computing 8.14 Cannon's Algorithm: Performance Analysis In the alignment step, the maximum distance over which a block shifts is, –the two shift operations require a total of time. Each of the single-step shifts in the compute-and-shift phase of the algorithm takes time. The computation time for multiplying matrices of size is. The parallel time is approximately:

Sahalu Junaidu ICS 573: High Performance Computing 8.15 MPI_Cart_shift Function Shifting data along the dimensions of the 2-D mesh is a frequent operation in the Cannon’s algorithm –MPI provides the function MPI_Cart_shift for this purpose. int MPI_Cart_shift( MPI_Comm comm_cart,/* communicator with Cartesian structure (handle)*/ int dir, /* direction of shift (> 0: up shift, < 0: down shift) */ int s_step, /* shift size/displacement */ int *rank_source, /* rank of source process */ int *rank_dest) /* rank of destination process */ Here is an example program exercising this function.program

Sahalu Junaidu ICS 573: High Performance Computing 8.16 Sending and Receiving Messages Simultaneously To exchange messages, MPI provides the following function: int MPI_Sendrecv(void *sendbuf, int sendcount, MPI_Datatype senddatatype, int dest, int sendtag, void *recvbuf, int recvcount, MPI_Datatype recvdatatype, int source, int recvtag, MPI_Comm comm, MPI_Status *status) The arguments include arguments to the send and receive functions. If we wish to use the same buffer for both send and receive, we can use: int MPI_Sendrecv_replace(void *buf, int count, MPI_Datatype datatype, int dest, int sendtag, int source, int recvtag, MPI_Comm comm, MPI_Status *status) A Parallel program for Cannon’s algorithm is here.here

Sahalu Junaidu ICS 573: High Performance Computing 8.17 Overlapping Communication with Computation Our MPI programs so far used blocking send/receive operations to perform point-to-point communication. As discussed earlier, –a blocking send operation remains blocked until the message has been copied out of the send buffer –a blocking receive operation returns only after the message has been received and copied into the receive buffer. In the Cannon algorithm, for example, each process blocks on MPI_Sendrecv_replace –until the specified matrix block has been sent and received by the corresponding processes. Note that the blocks of matrices A and B do not change as they are shifted among the processors –Thus, we can overlap the transmission of these blocks with the computation for the matrix-matrix multiplication –Many recent distributed-memory parallel computers have dedicated communication controllers that can perform the transmission of messages without interrupting the CPUs.

Sahalu Junaidu ICS 573: High Performance Computing 8.18 Non-Blocking Communication Operations In order to overlap communication with computation, MPI provides a pair of functions for performing non-blocking send and receive operations. int MPI_Isend(void *buf, int count, MPI_Datatype datatype, int dest, int tag, MPI_Comm comm,MPI_Request *request) int MPI_Irecv(void *buf, int count, MPI_Datatype datatype, int source, int tag, MPI_Comm comm, MPI_Request *request) These operations return before the operations have been completed. Function MPI_Test tests whether or not the non-blocking send or receive operation identified by its request has finished. int MPI_Test(MPI_Request *request, int *flag, MPI_Status *status) MPI_Wait waits for the operation to complete. An example is here.here int MPI_Wait(MPI_Request *request, MPI_Status *status)

Sahalu Junaidu ICS 573: High Performance Computing 8.19 Canon’s Algorithm using Non-Blocking Operations Here is the parallel program for Cannon’s algorithm using nonblocking operationsprogram Two main differences between this program and the earlier one using blocking operations: 1.Additional arrays a_buffers and b_buffers, are used for the blocks of A and B that are being received while the computation involving the previous blocks is performed. 2.in the main computational loop, it first starts the non-blocking send operations to send the locally stored blocks of A and B to the processes left and up the grid, and then starts the non-blocking receive operations to receive the blocks for the next iteration from the processes right and down the grid. After starting these four non-blocking operations, it proceeds to perform the matrix-matrix multiplication of the blocks it currently stores. Finally, before it proceeds to the next iteration, it uses MPI_Wait to wait for the send and receive operations to complete.