September 4, 1997 Parallel Processing (CS 730) Lecture 7: Grouping Data and Communicators in MPI Jeremy R. Johnson *Parts of this lecture was derived.

September 4, 1997 Parallel Processing (CS 730) Lecture 7: Grouping Data and Communicators in MPI Jeremy R. Johnson *Parts of this lecture was derived from chapters 6,7 in Pacheco Oct. 30, 2002 Parallel Processing

September 4, 1997 Introduction Objective: To introduce MPI commands for creating types and communicators. To discuss performance models and considerations in MPI and ways of reducing communication. Topics MPI Datatypes and packing Revised version of Get_Data Matrix transposition Creating communicators Topologies Grids Matrix Multiplication Fox’s algorithm Performance Model Oct. 30, 2002 Parallel Processing

Derived Types Due to latency of communication it is usually a good idea to package up several elements into a single message MPI_Send and MPI_Recv, allow message to be given by a start address, basic type, and a count. This allows multiple data elements to be sent in one message Requires elements to be of the same type Must be contiguous A generalized type {(t0,d0),…,(tn-1,dn-1)} ti is an existing type di is a displacement Oct. 30, 2002 Parallel Processing

Functions for Creating MPI Types
int MPI_Type_struct(int count, int block_lengths[], MPI_Aint displacements[], MPI_Datatype typelist[], MPI_Datatype new_mpi_t) MPI_Address( void* location, MPI_Aint* address) int MPI_Type_commit(MPI_Datatype* new_mpi_t) Oct. 30, 2002 Parallel Processing

Other Derived Datatype Constructors
int MPI_Type_vector(int count, int block_length, int stride, MPI_Datatype element_type, MPI_Datatype new_mpi_t) int MPI_Type_contiguous(int count, MPI_Datatype old_type, MPI_Datatype new_mpi_t) int MPI_Type_indexed(int count, int block_lengths[], int displacements[], MPI_Datatype old_type, MPI_Datatype new_mpi_t) Oct. 30, 2002 Parallel Processing

Transpose float A[10][10]; /* stored in row-major order. */
/* Send 3rd row of A from process 0 to process 1. */ If (my_rank == 0) { MPI_Send(&(A[2][0]), 10, MPI_FLOAT, 1, 0, MPI_COMM_WORLD); } else { MPI_Recv(&(A[2][0]), 10, MPI_FLOAT, 0, 0, MPI_COMM_WORLD, &status); } /* Doesn’t work for columns, since not contiguous. */ Oct. 30, 2002 Parallel Processing

Transpose float A[10][10]; /* stored in row-major order. */
/* Send 3rd column of A from process 0 to process 1. */ MPI_Type_vector(10, 1, 10, MPI_FLOAT, &column_mpi_t); MPI_Type_commit(&column_mpi_t); If (my_rank == 0) { MPI_Send(&(A[0][2]), 1, column_mpi_t, 1, 0, MPI_COMM_WORLD); } else { MPI_Recv(&(A[0][2]), 1, column_mpi_t, 0, 0, MPI_COMM_WORLD, &status); } Oct. 30, 2002 Parallel Processing

Upper Triangular Matrix
float A[n][n]; /* Complete matrix */ float T[n][n]; /* Upper triangle. */ int displacements[n]; int block_lengths[n]; MPI_Datatype index_mpi_t; for(i = 0; i < n; i++) { block_lengths[i] = n-i; displacements[i] = (n+1)*i; } MPI_Type_indexed(n, block_lengths, displacments, MPI_FLOAT, &index_mpi_t); MPI_Type_commit(&index_mpi_t); if (my_rank == 0) { MPI_Send(A, 1, index_mpi_t, 1, 0, MPI_COMM_WORLD); else MPI_Recv(T, 1, index_mpi_t, 0, 0, MPI_COMM_WORLD, &status); Oct. 30, 2002 Parallel Processing

Type Matching When can a receiving process match the data send by a sending process? MPI_Send(message, send_count, send_mpi_t, 1, 0, MPI_COMM_WORLD) MPI_Recv(message, recv_count, recv_mpi_t, 0, 0,MPI_COMM_WORLD,&status) Given a derived type {(t0,d0),…,(tn-1,dn-1)} Displacements do not matter Type signatures {t0,…,tn-1} and {u0,…,um-1} must be compatible n  m, ti = ui for i=0,…,n-1 For collective communications sending and receiving types must be identical Oct. 30, 2002 Parallel Processing

Type Matching Example For type column_mpi_t (column of 10  10 array of floats) {(MPI_FLOAT,0), (MPI_FLOAT,10*sizeof(float)), (MPI_FLOAT,20*sizeof(float)),…,(MPI_FLOAT,90*sizeof(float))} Signature is {MPI_FLOAT,…,MPI_FLOAT}, MPI_FLOAT 10 times Example: Can send column to row float A[10][10]; /* stored in row-major order. */ If (my_rank == 0) MPI_Send(&(A[0][0]), 1, column_mpi_t, 1, 0, MPI_COMM_WORLD); else if (my_rank == 0) MPI_Recv(&(A[0][0]),10, MPI_FLOAT, 0, 0, MPI_COMM_WORLD, &status); Oct. 30, 2002 Parallel Processing

Pack and Unpack MPI_Pack and MPI_Unpack allow a user to copy non-contiguous memory locations into a contiguous buffer and to copy a contiguous buffer into non-contiguous memory locations int MPI_Pack(void* pack_data, int in_count, MPI_Datatype datatype, void* buffer, int buffer_size, int* position, MPI_Comm comm) On input copy data starting at location &buffer + position On output position points to first location in the buffer after pack_data int MPI_Unpack(void* buffer, int size, int* position,void* unpack_data, int count,MPI_Datatype datatype, MPI_Comm comm) Data starting at location &buffer + *position is copied into memory referenced by unpack_data count data elements of type datatype are copied into unpack_data position is updated to point to location in buffer after the data just copied Messages constructed with MPI_Pack should be communicated with datatype argument MPI_PACKED Oct. 30, 2002 Parallel Processing

Deciding which Method to Use
Creating a derived type has overhead associated with it. Depends on number of times type will be used. Can avoid system buffering with Pack/Unpack Can use variable length messages with Pack/Unpack Oct. 30, 2002 Parallel Processing

Variable Length Messages
float* entries; int* column_subscripts; int nonzeros; int position; int row_number; char buffer[HUGE]; if (my_rank == 0) { position = 0; MPI_Pack(&nonzeros,1,MPI_INT,buffer,HUGE,&position,MPI_COMM_WORLD); MPI_Pack(row_number,1,MPI_INT,buffer,HUGE,&position,MPI_COMM_WORLD); MPI_Pack(entries,nonzeros,MPI_FLOAT,buffer,HUGE, &position,MPI_COMM_WORLD); MPI_Pack(column_subscripts,nonzeros,MPI_FLOAT,buffer,HUGE, MPI_Send(buffer,position,MPI_PACKED, 1,0,MPI_COMM_WORLD); } Oct. 30, 2002 Parallel Processing

Variable Length Messages (cont)
else { MPI_Recv(buffer,HUGE,MPI_PACKED, 0,0, MPI_COMM_WORLD, &status); position = 0; MPI_UnPack(buffer,HUGE,&position, &nonzeros, MPI_INT,MPI_COMM_WORLD); MPI_UnPack(buffer,HUGE,&position, &row_number, MPI_INT, MPI_COMM_WORLD); entries = (float *) malloc(nonzeros*sizeof(float)); column_subscripts = (int *) malloc(nonzeros*sizeof(int)); MPI_UnPack(buffer,HUGE,&position, entries, MPI_FLOAT, MPI_UnPack(buffer,HUGE,&position, &column_subscripts, MPI_INT, } Oct. 30, 2002 Parallel Processing

Communicators A mechanism to treat a subset of processes as a universe for communication (both point-to-point and collective) Types intro-communicator inter-communicator Components group (ordered collection of processes) context (unique identifier) optional additional information such as topology Create new communicators from existing communicators Oct. 30, 2002 Parallel Processing

Working with Groups, Contexts, and Communicators
MPI_Comm_group(MPI_Comm comm, MPI_Group* group) MPI_Group_incl(MPI_Group old_group, int new_group_size, int ranks_in_old_group[]; MPI_Group* new_group) MPI_Comm_create(MPI_Comm old_comm, MPI_Group group, MPI_Comm* new_com) Oct. 30, 2002 Parallel Processing

Creating a Communicator
/* Create communicator out of first row of q^2 processes organized in a q  q grid in row-major order. */ MPI_Group group_world; MPI_Group first_row_group; MPI_Comm first_row_comm; int* process_ranks; process_ranks = (int*) malloc(q*sizeof(int)); for (proc = 0; proc < q; proc++) process_ranks[proc] = proc; MPI_Comm_Group(MPI_COMM_WORLD,&group_world); MPI_Group_incl(group_world,q,process_ranks,&first_row_group); MPI_Comm_create(MPI_COMM_WORLD,first_row_group,&first_row_comm); Oct. 30, 2002 Parallel Processing

Using a Communicator /* Broadcast first block to all processes in the same row. */ int my_rank_in_first_row; float* A00; if (my_rank < q) { MPI_Comm_rank(first_row_comm,&my_rank_in_first_row); A_00 = (float *) malloc(n_bar*n_bar*sizeof(float)); if (my_rank_in_first_row == 0) { /* initialize A_00 */} MPI_Bcast(A_00,n_bar*n_bar,MPI_FLOAT,0,first_row_comm); } Oct. 30, 2002 Parallel Processing

MPI_Comm_split int MPI_Comm_split(MPI_Comm old_comm,
int split_key, int rank_key, MPI_Comm new_comm) MPI_Comm my_row_comm; int my_row; /* my_rank is in MPI_COMM_WORLD, q*q = p */ my_row = my_rank/q; MPI_Comm_split(MPI_COMM_WORLD,my_row,my_rank,&my_row_comm); /* Creates q new communicators. Processes with the same value of split_key form a new group. The rank in the new communicator is determined by rank_key. Order is preserved. If the same rank_key is used, then the choice is arbitrary. */ Oct. 30, 2002 Parallel Processing

Topologies Communicators can have attributes. One such attribute is a topology. A topology is a mechanism for associating a different addressing scheme with processes belonging to a group. Provides a virtual interconnection organization of processes that may be convenient for a particular algorithm. Types Cartesian (grid) Graph Oct. 30, 2002 Parallel Processing

Working with Cartesian Topologies
MPI_Cart_create(MPI_Comm old_comm, int number_of_dims, int dim_sizes[], int wrap_around[], int reorder, MPI_Comm* cart_comm) MPI_Cart_rank(MPI_Comm comm, int rank, int number_of_dims, int* rank) MPI_Cart_coords(MPI_Comm comm, int rank, int number_of_dims, int coordinates[]) MPI_Cart_sub(MPI_Comm cart_comm, int free_coords[], MPI_Comm* new_comm) Oct. 30, 2002 Parallel Processing

Creating a Cartesian Topology
/* create communicator with 2D grid topology. */ MPI_Comm grid_comm; int dim_sizes[2]; int wrap_around[2]; int reorder = 1; dim_sizes[0] = dim_sizes[1] = q; wrap_around[0] = wrap_around[1] = 1; MPI_Cart_create(MPI_COMM_WORLD, 2, dim_sizes,wrap_around,reorder, &grid_comm); Oct. 30, 2002 Parallel Processing

Creating a Sub-Cartesian Topology
int free_coords[2]; MPI_Comm row_comm; /* create communicator for each row of grid_comm */ free_coords[0] = 0; free_coords[1] = 1; MPI_Cart_sub(grid_comm, free_coords, &row_comm) /* create communicator for each column of grid_comm */ free_coords[0] = 1; free_coords[1] = 0; MPI_Cart_sub(grid_comm, free_coords, &col_comm) Oct. 30, 2002 Parallel Processing

Cartesian Addressing int coordinates[2]; int my_grid_rank;
MPI_Comm_rank(grid_comm, &my_grid_rank); MPI_Cart_coords(grid_comm, my_grid_rank,2, coordinates); /* inverse operation */ MPI_Cart_rank(grid_comm, coordinates, &my_grid_rank) Oct. 30, 2002 Parallel Processing

Matrix Multiplication
Let A, B be n  n matrices, and C = A*B void Serial_matrix_mult(MATRIX_T A, MATRIX_T B, MATRIX_T C, int n) { int i,j,k; for (i=0; i< n; i++) for (j=0; j< n;j++) { C[i][j] = 0.0; for (k=0; k < n;k++) C[i][j] = C[i][j] + A[i][k]*B[k][j]; } Oct. 30, 2002 Parallel Processing

Parallel Matrix Multiplication
/* distribute matrices by rows. */ void Parallel_matrix_mult(MATRIX_T A, MATRIX_T B, MATRIX_T C, int n) { for each column of B { Allgather(column); Compute dot product of my row of A with column; } /* can distribute matrices by blocks of rows. Also B could be distributed by * columns */ Oct. 30, 2002 Parallel Processing

Cyclic Matrix Multiplication
/* Arrange processors in a circle, storing rows of A and columns of B in each process. Ci* = Ai* * B*0 + … + Ai* * B*n-1 */ void Parallel_matrix_mult(MATRIX_T A, MATRIX_T B, MATRIX_T C, int n) { i = rank; Blocal = ith column of B; Alocal = ith row of A; Clocal = 0; dest = (i+1) % n; src = (i-1) % n; for (k=0;k<n;k++) { Clocal = Clocal + Alocal * Blocal; send_recv(Blocal,dest,src); } Oct. 30, 2002 Parallel Processing

Fox’s Matrix Multiplication
Let A, B be q  q matrices, and C = A*B Organize processors into a sqrt(p)  sqrt(p) grid Store (i,j) block on processor (i,j) Broadcast elements of A as k = 0,…,n-1 Cyclically rotate elements of B. Oct. 30, 2002 Parallel Processing

Example A00 B00 A00 B01 A00 B02 A11 B10 A11 B11 A11 B12 A22 B20
Oct. 30, 2002 Parallel Processing

Fox’s Matrix Multiplication
/* Uses a block matrix allocation. Group processors in a q × q grid, where q = sqrt(p). Processor (i,j) stores Aij and initially Bij */ void Parallel_matrix_mult(MATRIX_T A, MATRIX_T B, MATRIX_T C, int n) { i = my process row; j = my process column; dest = ((i-1) % q,j); src = ((i+1) % q,j); for (stage=0;stage < q; stage++) { k_bar = (i + stage) mod q; Broadcast A[i,k_bar] across process row i; C[i,j] = C[i,j] + A[i,k_bar]*B[k_bar,j]; Send B[k_bar,j] to dest; Receive B[(k_bar+1) mod q,j] from source; } Oct. 30, 2002 Parallel Processing

September 4, 1997 Parallel Processing (CS 730) Lecture 7: Grouping Data and Communicators in MPI Jeremy R. Johnson *Parts of this lecture was derived.

Similar presentations

Presentation on theme: "September 4, 1997 Parallel Processing (CS 730) Lecture 7: Grouping Data and Communicators in MPI Jeremy R. Johnson *Parts of this lecture was derived."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

September 4, 1997 Parallel Processing (CS 730) Lecture 7: Grouping Data and Communicators in MPI Jeremy R. Johnson *Parts of this lecture was derived.

Similar presentations

Presentation on theme: "September 4, 1997 Parallel Processing (CS 730) Lecture 7: Grouping Data and Communicators in MPI Jeremy R. Johnson *Parts of this lecture was derived."— Presentation transcript:

Similar presentations

About project

Feedback