1 Friday, October 20, 2006 “Work expands to fill the time available for its completion.” -Parkinson’s 1st Law.

Slides:

Advertisements

Similar presentations

Basic Communication Operations

Advertisements

Source: MPI – Message Passing Interface Communicator groups and Process Topologies Source:

Dense Matrix Algorithms. Topic Overview Matrix-Vector Multiplication Matrix-Matrix Multiplication Solving a System of Linear Equations.

Sahalu Junaidu ICS 573: High Performance Computing 8.1 Topic Overview Matrix-Matrix Multiplication Block Matrix Operations A Simple Parallel Matrix-Matrix.

Parallel Matrix Operations using MPI CPS 5401 Fall 2014 Shirley Moore, Instructor November 3,

1 Parallel Programming with MPI- Day 4 Science & Technology Support High Performance Computing Ohio Supercomputer Center 1224 Kinnear Road Columbus, OH.

Introduction to MPI Programming (Part III)‏ Michael Griffiths, Deniz Savas & Alan Real January 2006.

1 Process Groups & Communicators  Communicator is a group of processes that can communicate with one another.  Can create sub-groups of processes, or.

CS 484. Dense Matrix Algorithms There are two types of Matrices Dense (Full) Sparse We will consider matrices that are Dense Square.

Parallel Programming in C with MPI and OpenMP Michael J. Quinn.

Parallel Programming: Techniques and Applications Using Networked Workstations and Parallel Computers Chapter 11: Numerical Algorithms Sec 11.2: Implementing.

HPDC Spring MPI 11 CSCI-6964: High Performance Parallel & Distributed Computing (HPDC) AE 216, Mon/Thurs. 2 – 3:20 p.m Message Passing Interface.

Examples of Two- Dimensional Systolic Arrays. Obvious Matrix Multiply Rows of a distributed to each PE in row. Columns of b distributed to each PE in.

Chapter 6 Floyd’s Algorithm. 2 Chapter Objectives Creating 2-D arrays Thinking about “grain size” Introducing point-to-point communications Reading and.

Chapter 6 Floyd’s Algorithm. 2 Chapter Objectives Creating 2-D arrays Thinking about “grain size” Introducing point-to-point communications Reading and.

Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Chapter 8 Matrix-vector Multiplication.

Design of parallel algorithms

CS 584. Dense Matrix Algorithms There are two types of Matrices Dense (Full) Sparse We will consider matrices that are Dense Square.

Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Parallel Programming in C with MPI and OpenMP Michael J. Quinn.

MPI Point-to-Point Communication CS 524 – High-Performance Computing.

Chapter 5, CLR Textbook Algorithms on Grids of Processors.

Design of parallel algorithms Matrix operations J. Porras.

Dense Matrix Algorithms CS 524 – High-Performance Computing.

Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Parallel Programming in C with MPI and OpenMP Michael J. Quinn.

1 Matrix Addition, C = A + B Add corresponding elements of each matrix to form elements of result matrix. Given elements of A as a i,j and elements of.

1 Tuesday, October 31, 2006 “Data expands to fill the space available for storage.” -Parkinson’s Law.

Today Objectives Chapter 6 of Quinn Creating 2-D arrays Thinking about “grain size” Introducing point-to-point communications Reading and printing 2-D.

Lecture 8: Cascaded Linear Transformations Row and Column Selection Permutation Matrices Matrix Transpose Sections 2.2.3, 2.3.

Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Parallel Programming in C with MPI and OpenMP Michael J. Quinn.

Exercise problems for students taking the Programming Parallel Computers course. Janusz Kowalik Piotr Arlukowicz Tadeusz Puzniakowski Informatics Institute.

Basic Communication Operations Based on Chapter 4 of Introduction to Parallel Computing by Ananth Grama, Anshul Gupta, George Karypis and Vipin Kumar These.

Parallel Programming with MPI Prof. Sivarama Dandamudi School of Computer Science Carleton University.

Message Passing Programming Model AMANO, Hideharu Textbook pp. １４０－１４７.

Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M

CS4402 – Parallel Computing

Parallel Programming in C with MPI and OpenMP Michael J. Quinn.

Lecture 4 TTH 03:30AM-04:45PM Dr. Jianjun Hu CSCE569 Parallel Computing University of South Carolina Department of.

MPI Workshop - III Research Staff Cartesian Topologies in MPI and Passing Structures in MPI Week 3 of 3.

1. 2 The logical view of a machine supporting the message-passing paradigm consists of p processes, each with its own exclusive address space. The logical.

Lecture 9 Architecture Independent (MPI) Algorithm Design

Basic Communication Operations Carl Tropper Department of Computer Science.

PARALLEL COMPUTATION FOR MATRIX MULTIPLICATION Presented By:Dima Ayash Kelwin Payares Tala Najem.

Parallel Algorithms & Implementations: Data-Parallelism, Asynchronous Communication and Master/Worker Paradigm FDI 2007 Track Q Day 2 – Morning Session.

MPI Groups, Communicators and Topologies. Groups and communicators In our case studies, we saw examples where collective communication needed to be performed.

Message Passing Programming Based on MPI Collective Communication I Bora AKAYDIN

Message Passing Interface Using resources from

COMP7330/7336 Advanced Parallel and Distributed Computing MPI Programming: 1. Collective Operations 2. Overlapping Communication with Computation Dr. Xiao.

CSc 8530 Matrix Multiplication and Transpose By Jaman Bhola.

Numerical Algorithms Chapter 11.

Properties and Applications of Matrices

Source: MPI – Message Passing Interface Communicator groups and Process Topologies Source:

Auburn University COMP7330/7336 Advanced Parallel and Distributed Computing Message Passing Interface (cont.) Topologies.

Parallel Programming in C with MPI and OpenMP

Parallel Programming with MPI and OpenMP

Collective Communication Operations

MPI Groups, Communicators and Topologies

CSCE569 Parallel Computing

Parallel Matrix Operations

CSCE569 Parallel Computing

Parallel Programming in C with MPI and OpenMP

Parallel Programming with MPI- Day 4

CSCE569 Parallel Computing

CS100: Discrete structures

Hybrid Parallel Programming

Matrix Addition and Multiplication

To accompany the text “Introduction to Parallel Computing”,

Matrix Addition, C = A + B Add corresponding elements of each matrix to form elements of result matrix. Given elements of A as ai,j and elements of B as.

Parallel Programming in C with MPI and OpenMP

Hybrid Parallel Programming

Arrays and Matrices Prof. Abdul Hameed.

Presentation transcript:

1 Friday, October 20, 2006 “Work expands to fill the time available for its completion.” -Parkinson’s 1st Law

2 MPI_Recv(void *buf, int count, MPI_Datatype datatype, int source, int tag, MPI_Comm comm, MPI_Status *status) MPI_Get_count(MPI_Status *status, MPI_Datatype datatype int *count_recvd) §Returns number of entries received in count_recvd variable.

3 Matrix Vector Multiplication  n x n matrix A §Vector b §x=Ab §p processing elements §Suppose A is distributed row-wise (n/p rows per process) §Each process computes different portion of x

4 Matrix Vector Multiplication (Initial distribution. Colors represent data distributed on different processes) n/p rows Abx

5 Matrix Vector Multiplication (Colors represent that all parts of b are required by each process) n/p rows A bx

6 Matrix Vector Multiplication (All parts of b are required by each process) §Which collective operation can we use?

7 Matrix Vector Multiplication (All parts of b are required by each process)

8 Collective communication

9 Matrix Vector Multiplication  n x n matrix A §Vector b §x=Ab §p processing elements §Suppose A is distributed column-wise (n/p columns per process) §Each process computes different portion of x.

10 Matrix Vector Multiplication (initial distribution. Colors represent data distributed on different processes) n/p cols A bx

11 partial x0 partial x1 partial x2 partial x3 Ab x0 x1 x2 x3 x Partial sums calculated by each process partial x0 n/p cols

12

Task 0Task 1Task 2Task Task 1 MPI_Reduce Element wise reduction can be done. count=4 dest=1

14

15

16 §Row-wise requires one MPI_Allgather operation. §Column-wise requires MPI_Reduce and MPI_Scatter operations.

17 Matrix Matrix Multiplication §A and B are nxn matrices §p is the number of processing elements §The matrices are partitioned into blocks of size n/√p x n/√p

18 ABC 16 processes each represented by a different color. Different portions of the nxn matrices are divided among these processes.

19 ABC 16 processes each represented by a different color. Different portions of the nxn matrices are divided among these processes. BUT! To compute C i,j we need all sub-matrices A i,k and B k,j for 0<=k<√p

20 To compute Ci,j we need all sub-matrices Ai,k and Bk,j for 0<=k<√p §All to all broadcast of matrix A’s blocks in each row §All to all broadcast of matrix B’s blocks in each column

21 Canon’s Algorithm §Memory efficient version of the previous algorithm. Each process in ith row requires all √p sub- matrices Ai,k 0<=k<√p §Schedule computation so that computation of √p processes in ith row use diferent Ai,k at any given time

22 AB 16 processes each represented by a different color. Different portions of the nxn matrices are divided among these processes.

23 A 00 A 01 A 02 A 03 A 10 A 11 A 12 A 13 A 20 A 21 A 22 A 23 A 30 A 31 A 32 A 33 ABC B 00 B 01 B 02 B 03 B 10 B 11 B 12 B 13 B 20 B 21 B 22 B 23 B 30 B 31 B 32 B 33

24 A 00 A 01 A 02 A 03 ABC Canon’s Algorithm B 00 B 10 B 20 B 30 To compute C 0,0 we need all sub-matrices A 0,k and B k,0 for 0<=k<√p

25 A 01 A 02 A 03 A 00 ABC Canon’s Algorithm B 10 B 20 B 30 B 00 Shift leftShift up

26 A 02 A 03 A 00 A 01 ABC Canon’s Algorithm B 20 B 30 B 00 B 10 Shift leftShift up

27 A 03 A 00 A 01 A 02 AB C 00 C Canon’s Algorithm B 30 B 00 B 10 B 20 Shift leftShift up Sequence of √p sub-matrix multiplications done.

28 A 00 A 01 A 02 A 03 A 10 A 11 A 12 A 13 A 20 A 21 A 22 A 23 A 30 A 31 A 32 A 33 ABC B 00 B 01 B 02 B 03 B 10 B 11 B 12 B 13 B 20 B 21 B 22 B 23 B 30 B 31 B 32 B 33

29 A 00 A 01 A 02 A 03 A 10 A 11 A 12 A 13 A 20 A 21 A 22 A 23 A 30 A 31 A 32 A 33 ABC B 00 B 01 B 02 B 03 B 10 B 11 B 12 B 13 B 20 B 21 B 22 B 23 B 30 B 31 B 32 B 33 A 01 and B 01 should not be multiplied!

30 A 00 A 01 A 02 A 03 A 10 A 11 A 12 A 13 A 20 A 21 A 22 A 23 A 30 A 31 A 32 A 33 ABC B 00 B 01 B 02 B 03 B 10 B 11 B 12 B 13 B 20 B 21 B 22 B 23 B 30 B 31 B 32 B 33 Some initial alignment required!

31 A 00 A 01 A 02 A 03 A 10 A 11 A 12 A 13 A 20 A 21 A 22 A 23 A 30 A 31 A 32 A 33 ABC B 00 B 01 B 02 B 03 B 10 B 11 B 12 B 13 B 20 B 21 B 22 B 23 B 30 B 31 B 32 B 33 Shift all sub-matrices Ai,j to the left (with wraparound) by i steps Shift all sub-matrices Bi,j up (with wraparound) by j steps After circular shift operations, P ij has submatrices A i,(j+i)mod√p and B (i+j)mod√p, j

32 A 00 A 01 A 02 A 03 A 11 A 12 A 13 A 10 A 22 A 23 A 20 A 21 A 33 A 30 A 31 A 32 B 00 B 11 B 22 B 33 B 10 B 21 B 32 B 03 B 20 B 31 B 02 B 13 B 30 B 01 B 12 B 23 AB After initial alignment:

33 Topologies §Many computational science and engineering problems use a series of matrix or grid operations. §The dimensions of the matrices or grids are often determined by the physical problems. §Frequently in multiprocessing, these matrices or grids are partitioned, or domain- decomposed, so that each partition is assigned to a process.

34 Topologies §MPI uses linear ordering and views processes in 1-D topology. §Although it is still possible to refer to each of the partitions by a linear rank number, a mapping of the linear process rank to a higher dimensional virtual rank numbering would facilitate a much clearer and natural computational representation.

35 Topologies §To address the needs of this MPI library provides topology routines. §Interacting processes would be identified by coordinates in that topology.

36 Topologies §Each MPI process would be mapped in the higher dimensional topology. Different ways to map a set of processes to a two-dimensional grid. (a) and (b) show a row- and column-wise mapping of these processes, (c) shows a mapping that follows a space-filling curve (dotted line), and (d) shows a mapping in which neighboring processes are directly connected in a hypercube.

37 Topologies §Ideally, mapping would be determined by interaction among processes and connectivity of physical processors. §However, mechanism for assigning ranks to MPI does not use information about interconnection network. §Reason: Architecture independent advantages of MPI (otherwise different mappings would have to be specified for different interconnection networks) §Left to MPI library to find appropriate mapping that reduces cost of sending and receiving messages.

38 §MPI allows specification of virtual process topologies of in terms of a graph §Each node in graph corresponds to a process and edge exists between two nodes if they communicate with each other. §Most common topologies are Cartesian topologies (one, two or higher grids)

39 Creating and Using Cartesian Topologies We can create Cartesian topologies using the function: int MPI_Cart_create( MPI_Comm comm_old, int ndims, int *dims, int *periods, int reorder, MPI_Comm *comm_cart)

40

41

42 With processes renamed in a 2D grid topology, we are able to assign or distribute work, or distinguish among the processes by their grid topology rather than by their linear process ranks.

43 §MPI_CART_CREATE is a collective communication function. l It must be called by all processes in the group.

44 Creating and Using Cartesian Topologies §Since sending and receiving messages still require (one- dimensional) ranks, MPI provides routines to convert ranks to Cartesian coordinates and vice-versa. int MPI_Cart_coord(MPI_Comm comm_cart, int rank, int maxdims, int *coords) int MPI_Cart_rank(MPI_Comm comm_cart, int *coords, int *rank)

45 Creating and Using Cartesian Topologies §The most common operation on Cartesian topologies is a shifting data along a dimension of the topology. int MPI_Cart_shift(MPI_Comm comm_cart, int dir, int s_step, int *rank_source, int *rank_dest) l MPI_CART_SHIFT is used to find two "nearby" neighbors of the calling process along a specific direction of an N-dimensional Cartesian topology. l This direction is specified by the input argument, direction, to MPI_CART_SHIFT. l The two neighbors are called "source" and "destination" ranks.

46

47 …

48

49

50 …

51 Matrix Vector Multiplication (block distribution. Colors represent data distributed on different processes) Abx

52 Matrix Vector Multiplication ( Colors represent parts of b are required by each process ) Abx