VSCSE Summer School Programming Heterogeneous Parallel Computing Systems Lecture 6: Basic CUDA and MPI.

Slides:

Advertisements

Similar presentations

Numbers Treasure Hunt Following each question, click on the answer. If correct, the next page will load with a graphic first – these can be used to check.

Advertisements

Adders Used to perform addition, subtraction, multiplication, and division (sometimes) Half-adder adds rightmost (least significant) bit Full-adder.

1 Vorlesung Informatik 2 Algorithmen und Datenstrukturen (Parallel Algorithms) Robin Pomplun.

Chapter 7 Constructors and Other Tools. Copyright © 2006 Pearson Addison-Wesley. All rights reserved. 7-2 Learning Objectives Constructors Definitions.

Copyright © 2003 Pearson Education, Inc. Slide 1 Computer Systems Organization & Architecture Chapters 8-12 John D. Carpinelli.

Processes and Operating Systems

Copyright © 2011, Elsevier Inc. All rights reserved. Chapter 6 Author: Julia Richards and R. Scott Hawley.

Author: Julia Richards and R. Scott Hawley

1 Copyright © 2013 Elsevier Inc. All rights reserved. Chapter 3 CPUs.

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, University of Illinois, Urbana-Champaign 1 ECE408 / CS483 Applied Parallel Programming.

Properties Use, share, or modify this drill on mathematic properties. There is too much material for a single class, so you’ll have to select for your.

MPI 2.2 William Gropp. 2 Scope of MPI 2.2 Small changes to the standard. A small change is defined as one that does not break existing correct MPI 2.0.

Writing Pseudocode And Making a Flow Chart A Number Guessing Game

1 RA I Sub-Regional Training Seminar on CLIMAT&CLIMAT TEMP Reporting Casablanca, Morocco, 20 – 22 December 2005 Status of observing programmes in RA I.

Properties of Real Numbers CommutativeAssociativeDistributive Identity + × Inverse + ×

Create an Application Title 1A - Adult Chapter 3.

Custom Statutory Programs Chapter 3. Customary Statutory Programs and Titles 3-2 Objectives Add Local Statutory Programs Create Customer Application For.

Polygon Scan Conversion – 11b

1 Click here to End Presentation Software: Installation and Updates Internet Download CD release NACIS Updates.

Solve Multi-step Equations

1 Chapter 10 - Structures, Unions, Bit Manipulations, and Enumerations Outline 10.1Introduction 10.2Structure Definitions 10.3Initializing Structures 10.4Accessing.

Break Time Remaining 10:00.

Turing Machines.

Table 12.1: Cash Flows to a Cash and Carry Trading Strategy.

Accelerated Linear Algebra Libraries James Wynne III NCCS User Assistance.

PP Test Review Sections 6-1 to 6-6

Chapter 17 Linked Lists.

List Ranking and Parallel Prefix

1 DATA STRUCTURES. 2 LINKED LIST 3 PROS Dynamic in nature, so grow and shrink in size during execution Efficient memory utilization Insertion can be.

CSE Lecture 12 – Linked Lists …

Bright Futures Guidelines Priorities and Screening Tables

Bellwork Do the following problem on a ½ sheet of paper and turn in.

CS 6143 COMPUTER ARCHITECTURE II SPRING 2014 ACM Principles and Practice of Parallel Programming, PPoPP, 2006 Panel Presentations Parallel Processing is.

Exarte Bezoek aan de Mediacampus Bachelor in de grafische en digitale media April 2014.

Copyright © 2012, Elsevier Inc. All rights Reserved. 1 Chapter 7 Modeling Structure with Blocks.

Success with ModelSmart3D Pre-Engineering Software Corporation Written by: Robert A. Wolf III, P.E. Copyright 2001, Pre-Engineering Software Corporation,

1 RA III - Regional Training Seminar on CLIMAT&CLIMAT TEMP Reporting Buenos Aires, Argentina, 25 – 27 October 2006 Status of observing programmes in RA.

Basel-ICU-Journal Challenge18/20/ Basel-ICU-Journal Challenge8/20/2014.

CONTROL VISION Set-up. Step 1 Step 2 Step 3 Step 5 Step 4.

Adding Up In Chunks.

Copyright © 2013 by John Wiley & Sons. All rights reserved. HOW TO CREATE LINKED LISTS FROM SCRATCH CHAPTER Slides by Rick Giles 16 Only Linked List Part.

Procedures. 2 Procedure Definition A procedure is a mechanism for abstracting a group of related operations into a single operation that can be used repeatedly.

1 10 pt 15 pt 20 pt 25 pt 5 pt 10 pt 15 pt 20 pt 25 pt 5 pt 10 pt 15 pt 20 pt 25 pt 5 pt 10 pt 15 pt 20 pt 25 pt 5 pt 10 pt 15 pt 20 pt 25 pt 5 pt Synthetic.

Model and Relationships 6 M 1 M M M M M M M M M M M M M M M M

Subtraction: Adding UP

1 hi at no doifpi me be go we of at be do go hi if me no of pi we Inorder Traversal Inorder traversal. n Visit the left subtree. n Visit the node. n Visit.

Analyzing Genes and Genomes

Essential Cell Biology

Exponents and Radicals

PSSA Preparation.

Essential Cell Biology

Immunobiology: The Immune System in Health & Disease Sixth Edition

Physics for Scientists & Engineers, 3rd Edition

Energy Generation in Mitochondria and Chlorplasts

1 Non-Blocking Communications. 2 #include int main(int argc, char **argv) { int my_rank, ncpus; int left_neighbor, right_neighbor; int data_received=-1;

MPI Collective Communications

1 ITCS 5/4145 Parallel computing, B. Wilkinson, April 11, CUDAMultiDimBlocks.ppt CUDA Grids, Blocks, and Threads These notes will introduce: One.

SOME BASIC MPI ROUTINES With formal datatypes specified.

Parallel Programming with Java

Parallel Programming with MPI Prof. Sivarama Dandamudi School of Computer Science Carleton University.

MPI Introduction to MPI Commands. Basics – Send and Receive MPI is a message passing environment. The processors’ method of sharing information is NOT.

An Introduction to MPI (message passing interface)

© David Kirk/NVIDIA and Wen-mei W. Hwu ECE408/CS483/ECE498al, University of Illinois, ECE408 / CS483 Applied Parallel Programming Lecture 27:

Message Passing Programming Based on MPI Collective Communication I Bora AKAYDIN

COMP7330/7336 Advanced Parallel and Distributed Computing MPI Programming: 1. Collective Operations 2. Overlapping Communication with Computation Dr. Xiao.

© David Kirk/NVIDIA and Wen-mei W

Synchronizing Computations

Presentation transcript:

VSCSE Summer School Programming Heterogeneous Parallel Computing Systems Lecture 6: Basic CUDA and MPI

Objective To become proficient in writing a simple joint MPI-CUDA heterogeneous application – Understand the key sections of the application – Simplified code and efficient data movement using GMAC – One-way communication To become familiar with a more sophisticated MPI application that requires two-way data- exchange VSCSE 20122

CUDA-based cluster Each node contains N GPUs … … GPU 0GPU N PCIe CPU 0 CPU M Host Memory … … GPU 0GPU N PCIe CPU 0 CPU M Host Memory 3VSCSE 2012

MPI Model Many processes distributed in a cluster Each process computes part of the output Processes communicate with each other Processes can synchronize VSCSE Node

MPI Initialization, Info and Sync int MPI_Init(int *argc, char ***argv) – Initialize MPI MPI_COMM_WORLD – MPI group with all allocated nodes int MPI_Comm_rank (MPI_Comm comm, int *rank) – Rank of the calling process in group of comm int MPI_Comm_size (MPI_Comm comm, int *size) – Number of processes in the group of comm 5VSCSE 2012

Vector Addition: Main Process 6 int main(int argc, char *argv[]) { int vector_size = 1024 * 1024 * 1024; int pid=-1, np=-1; MPI_Init(&argc, &argv); MPI_Comm_rank(MPI_COMM_WORLD, &pid); MPI_Comm_size(MPI_COMM_WORLD, &np); if(np < 3) { if(0 == pid) printf(“Nedded 3 or more processes.\n"); MPI_Abort( MPI_COMM_WORLD, 1 ); return 1; } if(pid < np - 1) compute_node(vector_size / (np - 1)); else data_server(vector_size); MPI_Finalize(); return 0; } VSCSE 2012

MPI Sending Data int MPI_Send(void *buf, int count, MPI_Datatype datatype, int dest, int tag, MPI_Comm comm) – Buf: Initial address of send buffer (choice) – Count: Number of elements in send buffer (nonnegative integer) – Datatype: Datatype of each send buffer element (handle) – Dest: Rank of destination (integer) – Tag: Message tag (integer) – Comm: Communicator (handle) 7VSCSE 2012

MPI Sending Data int MPI_Send(void *buf, int count, MPI_Datatype datatype, int dest, int tag, MPI_Comm comm) – Buf: Initial address of send buffer (choice) – Count: Number of elements in send buffer (nonnegative integer) – Datatype: Datatype of each send buffer element (handle) – Dest: Rank of destination (integer) – Tag: Message tag (integer) – Comm: Communicator (handle) 8 Node VSCSE 2012

MPI Receiving Data int MPI_Recv(void *buf, int count, MPI_Datatype datatype, int source, int tag, MPI_Comm comm, MPI_Status *status) – Buf: Initial address of receive buffer (choice) – Count: Maximum number of elements in receive buffer (integer) – Datatype:Datatype of each receive buffer element (handle) – Source: Rank of source (integer) – Tag: Message tag (integer) – Comm: Communicator (handle) – Status: Status object (Status) 9VSCSE 2012

MPI Receiving Data int MPI_Recv(void *buf, int count, MPI_Datatype datatype, int source, int tag, MPI_Comm comm, MPI_Status *status) – Buf: Initial address of receive buffer (choice) – Count: Maximum number of elements in receive buffer (integer) – Datatype:Datatype of each receive buffer element (handle) – Source: Rank of source (integer) – Tag: Message tag (integer) – Comm: Communicator (handle) – Status: Status object (Status) 10 Node VSCSE 2012

Vector Addition: Server Process (I) 11 void data_server(unsigned int vector_size) { int np, num_nodes = np – 1, first_node = 0, last_node = np - 2; unsigned int num_bytes = vector_size * sizeof(float); float *input_a = 0, *input_b = 0, *output = 0; /* Set MPI Communication Size */ MPI_Comm_size(MPI_COMM_WORLD, &np); /* Allocate input data */ input_a = (float *)malloc(num_bytes); input_b = (float *)malloc(num_bytes); output = (float *)malloc(num_bytes); if(input_a == NULL || input_b == NULL || output == NULL) { printf(“Server couldn't allocate memory\n"); MPI_Abort( MPI_COMM_WORLD, 1 ); } /* Initialize input data */ random_data(input_a, vector_size, 1, 10); random_data(input_b, vector_size, 1, 10); VSCSE 2012

Vector Addition: Server Process (II) 12 /* Send data to compute nodes */ float *ptr_a = input_a; float *ptr_b = input_b; for(int process = 1; process < last_node; process++) { MPI_Send(ptr_a, vector_size / num_nodes, MPI_FLOAT, process, DATA_DISTRIBUTE, MPI_COMM_WORLD); ptr_a += vector_size / num_nodes; MPI_Send(ptr_b, vector_size / num_nodes, MPI_FLOAT, process, DATA_DISTRIBUTE, MPI_COMM_WORLD); ptr_b += vector_size / num_nodes; } /* Wait for nodes to compute */ MPI_Barrier(MPI_COMM_WORLD); VSCSE 2012

Vector Addition: Server Process (III) 13 /* Wait for previous communications */ MPI_Barrier(MPI_COMM_WORLD); /* Collect output data */ MPI_Status status; for(int process = 0; process < num_nodes; process++) { MPI_Recv(output + process * num_points / num_nodes, num_points / num_comp_nodes, MPI_REAL, process, DATA_COLLECT, MPI_COMM_WORLD, &status ); } /* Store output data */ store_output(output, dimx, dimy, dimz); /* Release resources */ free(input); free(output); } VSCSE 2012

Vector Addition: Compute Process (I) 14 void compute_node(unsigned int vector_size ) { int np; unsigned int num_bytes = vector_size * sizeof(float); float *input_a, *input_b, *output; MPI_Status status; MPI_Comm_size(MPI_COMM_WORLD, &np); int server_process = np - 1; /* Alloc host memory */ input_a = (float *)malloc(num_bytes); input_b = (float *)malloc(num_bytes); output = (float *)malloc(num_bytes); /* Get the input data from server process */ MPI_Recv(input_a, vector_size, MPI_FLOAT, server_process, DATA_DISTRIBUTE, MPI_COMM_WORLD, &status); MPI_Recv(input_b, vector_size, MPI_FLOAT, server_process, DATA_DISTRIBUTE, MPI_COMM_WORLD, &status); VSCSE 2012

Vector Addition: Compute Process (II) 15 /* Compute the partial vector addition */ for(int i = 0; i < vector_size; ++i) { output[i] = input_a[i] + input_b[i]; } /* Send the output */ MPI_Send(output, vector_size, MPI_FLOAT, server_process, DATA_COLLECT, MPI_COMM_WORLD); /* Release memory */ free(input_a); free(input_b); free(output); } VSCSE 2012

ADDING CUDA TO MPI VSCSE

Vector Addition: CUDA Process (I) 17 void compute_node(unsigned int vector_size ) { int np; unsigned int num_bytes = vector_size * sizeof(float); float *input_a, *input_b, *output; MPI_Status status; MPI_Comm_size(MPI_COMM_WORLD, &np); int server_process = np - 1; /* Allocate memory */ gmacMalloc((void **)&input_a, num_bytes); gmacMalloc((void **)&input_b, num_bytes); gmacMalloc((void **)&output, num_bytes); /* Get the input data from server process */ MPI_Recv(input_a, vector_size, MPI_FLOAT, server_process, DATA_DISTRIBUTE, MPI_COMM_WORLD, &status); MPI_Recv(input_b, vector_size, MPI_FLOAT, server_process, DATA_DISTRIBUTE, MPI_COMM_WORLD, &status); VSCSE 2012

Vector Addition: CUDA Process (II) 18 /* Compute the partial vector addition */ dim3 Db(BLOCK_SIZE); dim3 Dg((vector_size + BLOCK_SIZE – 1) / BLOCK_SIZE); vector_add_kernel >>(gmacPtr(output), gmacPtr(input_a), gmacPtr(input_b), vector_size); gmacThreadSynchronize(); /* Send the output */ MPI_Send(output, vector_size, MPI_FLOAT, server_process, DATA_COLLECT, MPI_COMM_WORLD); /* Release device memory */ gmacFree(d_input_a); gmacFree(d_input_b); gmacFree(d_output); } VSCSE 2012

A Typical Wave Propagation Application VSCSE Insert Source (e.g. acoustic wave) Stencil Computation to compute Laplacian Time Integration T == Tmax Do T = 0, Tmax Absorbing Boundary Conditions

Review of Stencil Computations VSCSE Boundary Conditions Laplacian and Time Integration

Wave Propagation: Kernel Code 21 /* Coefficients used to calculate the laplacian */ __constant__ float coeff[5]; __global__ void wave_propagation(float *next, float *in, float *prev, float *velocity, dim3 dim) { unsigned x = threadIdx.x + blockIdx.x * blockDim.x; unsigned y = threadIdx.y + blockIdx.y * blockDim.y; unsigned z = threadIdx.z + blockIdx.z * blockDim.z; /* Point index in the input and output matrixes */ unsigned n = x + y * dim.z + z * dim.x * dim.y; /* Only compute for points within the matrixes */ if(x < dim.x && y < dim.y && z < dim.z) { /* Calculate the contribution of each point to the laplacian */ float laplacian = coeff[0] + in[n]; VSCSE 2012

Wave Propagation: Kernel Code 22 for(int i = 1; i < 5; ++i) { laplacian += coeff[i] * (in[n – i] + /* Left */ in[n + i] + /* Right */ in[n – i * dim.x] + /* Top */ in[n + I * dim.x] + /* Bottom */ in[n – i * dim.x * dim.y] + /* Behind */ in[n + i * dim.x * dim.y]); /* Front */ } /* Time integration */ next[n] = velocity[n] * laplacian + 2 * in[n] – prev[n]; } VSCSE 2012

Stencil Domain Decomposition Volumes are split into tiles (along the Z-axis) – 3D-Stencil introduces data dependencies yz x D1 D2 D3 D4 23VSCSE 2012

Wave Propagation: Main Process 24 int main(int argc, char *argv[]) { int pad = 0, dimx = 480+pad, dimy = 480, dimz = 400, nreps = 100; int pid=-1, np=-1; MPI_Init(&argc, &argv); MPI_Comm_rank(MPI_COMM_WORLD, &pid); MPI_Comm_size(MPI_COMM_WORLD, &np); if(np < 3) { if(0 == pid) printf(“Nedded 3 or more processes.\n"); MPI_Abort( MPI_COMM_WORLD, 1 ); return 1; } if(pid < np - 1) compute_node(dimx, dimy, dimz / (np - 1), nreps); else data_server( dimx,dimy,dimz, nreps ); MPI_Finalize(); return 0; } VSCSE 2012

Stencil Code: Server Process (I) 25 void data_server(int dimx, int dimy, int dimz, int nreps) { int np, num_comp_nodes = np – 1, first_node = 0, last_node = np - 2; unsigned int num_points = dimx * dimy * dimz; unsigned int num_bytes = num_points * sizeof(float); float *input=0, *output = NULL, *velocity = NULL; /* Set MPI Communication Size */ MPI_Comm_size(MPI_COMM_WORLD, &np); /* Allocate input data */ input = (float *)malloc(num_bytes); output = (float *)malloc(num_bytes); velocity = (float *)malloc(num_bytes); if(input == NULL || output == NULL || velocity == NULL) { printf(“Server couldn't allocate memory\n"); MPI_Abort( MPI_COMM_WORLD, 1 ); } /* Initialize input data and velocity */ random_data(input, dimx, dimy,dimz, 1, 10); random_data(velocity, dimx, dimy,dimz, 1, 10); VSCSE 2012

Stencil Code: Server Process (II) 26 /* Calculate number of shared points */ int edge_num_points = dimx * dimy * (dimz / num_comp_nodes + 4); int int_num_points = dimx * dimy * (dimz / num_comp_nodes + 8); float *input_send_address = input; /* Send input data to the first compute node */ MPI_Send(send_address, edge_num_points, MPI_REAL, first_node, DATA_DISTRIBUTE, MPI_COMM_WORLD ); send_address += dimx * dimy * (dimz / num_comp_nodes - 4); /* Send input data to "internal" compute nodes */ for(int process = 1; process < last_node; process++) { MPI_Send(send_address, int_num_points, MPI_FLOAT, process, DATA_DISTRIBUTE, MPI_COMM_WORLD); send_address += dimx * dimy * (dimz / num_comp_nodes); } /* Send input data to the last compute node */ MPI_Send(send_address, edge_num_points, MPI_REAL, last_node, DATA_DISTRIBUTE, MPI_COMM_WORLD); VSCSE 2012

Stencil Code: Server Process (II) 27 float *velocity_send_address = velocity; /* Send velocity data to compute nodes */ for(int process = 0; process < last_node + 1; process++) { MPI_Send(send_address, edge_num_points, MPI_FLOAT, process, DATA_DISTRIBUTE, MPI_COMM_WORLD); send_address += dimx * dimy * (dimz / num_comp_nodes); } /* Wait for nodes to compute */ MPI_Barrier(MPI_COMM_WORLD); /* Collect output data */ MPI_Status status; for(int process = 0; process < num_comp_nodes; process++) MPI_Recv(output + process * num_points / num_comp_nodes, num_points / num_comp_nodes, MPI_FLOAT, process, DATA_COLLECT, MPI_COMM_WORLD, &status ); } VSCSE 2012

MPI Barriers int MPI_Barrier (MPI_Comm comm) – Comm: Communicator (handle) Blocks the caller until all group members have called it; the call returns at any process only after all group members have entered the call. 28VSCSE 2012

MPI Barriers Wait until all other processes in the MPI group reach the same barrier 1.All processes are executing Do_Stuff() 29 Node Example Code Do_stuff(); MPI_Barrier(); Do_more_stuff(); VSCSE 2012

MPI Barriers Wait until all other processes in the MPI group reach the same barrier 1.All processes are executing Do_Stuff() 2.Some processes reach the barrier 30 Node Example Code Do_stuff(); MPI_Barrier(); Do_more_stuff(); VSCSE 2012

MPI Barriers Wait until all other processes in the MPI group reach the same barrier 1.All processes are executing Do_Stuff() 2.Some processes reach the barrier and the wait in the barrier 31 Node Example Code Do_stuff(); MPI_Barrier(); Do_more_stuff(); VSCSE 2012

MPI Barriers Wait until all other processes in the MPI group reach the same barrier 1.All processes are executing Do_Stuff() 2.Some processes reach the barrier and the wait in the barrier until all reach the barrier 32 Node Example Code Do_stuff(); MPI_Barrier(); Do_more_stuff(); VSCSE 2012

MPI Barriers Wait until all other processes in the MPI group reach the same barrier 1.All processes are executing Do_Stuff() 2.Some processes reach the barrier and the wait in the barrier until all reach the barrier 3.All processes execute Do_more_stuff() 33 Node Example Code Do_stuff(); MPI_Barrier(); Do_more_stuff(); VSCSE 2012

Stencil Code: Server Process (III) 34 /* Store output data */ store_output(output, dimx, dimy, dimz); /* Release resources */ free(input); free(velocity); free(output); } VSCSE 2012

Stencil Code: Compute Process (I) 35 void compute_node_stencil(int dimx, int dimy, int dimz, int nreps ) { int np, pid; MPI_Comm_rank(MPI_COMM_WORLD, &pid); MPI_Comm_size(MPI_COMM_WORLD, &np); unsigned int num_points = dimx * dimy * (dimz + 8); unsigned int num_bytes = num_points * sizeof(float); unsigned int num_ghost_points = 4 * dimx * dimy; unsigned int num_ghost_bytes = num_ghost_points * sizeof(float); int left_ghost_offset = 0; int right_ghost_offset = dimx * dimy * (4 + dimz); float *input = NULL, *output = NULL, *prev = NULL, *v = NULL; /* Allocate device memory for input and output data */ gmacMalloc((void **)&input, num_bytes); gmacMalloc((void **)&output, num_bytes); gmacMalloc((void **)&prev, num_bytes); gmacMalloc((void **)&v, num_bytes); VSCSE 2012

Stencil Code: Compute Process (II) 36 MPI_Status status; int left_neighbor = (pid > 0) ? (pid - 1) : MPI_PROC_NULL; int right_neighbor = (pid < np - 2) ? (pid + 1) : MPI_PROC_NULL; int server_process = np - 1; /* Get the input data from server process */ float *rcv_address = input + num_ghost_points * (0 == pid); MPI_Recv(rcv_address, num_points, MPI_FLOAT, server_process, DATA_DISTRIBUTE, MPI_COMM_WORLD, &status ); /* Get the velocity data from server process */ rcv_address = h_v + num_ghost_points * (0 == pid); MPI_Recv(rcv_address, num_points, MPI_FLOAT, server_process, DATA_DISTRIBUTE, MPI_COMM_WORLD, &status ); VSCSE 2012

Stencil Code: Compute Process (III) 37 for(int i = 0; i < nreps; i++) { /* Compute values on the left of the volume */ launch_kernel(d_output + num_ghost_points, d_input + volume_offset, d_prev + volume_offset, dimx, dimy, dimz); /* Compute the boundary conditions in faces This launches one kernel per volume face */ launch_absorbing_boundary_conditions( d_output + num_ghost_points, d_input + num_ghost_points, d_prev + num_ghost_points, dimx, dimy, dimz); /* Wait for the computation to be done */ gmacThreadSynchronize(); VSCSE 2012

Stencil Code: Kernel Launch 38 void launch_kernel(float *next, float *in, float *prev, float *velocity, int dimx, int dimy, int dimz) { dim3 Gd, Bd, Vd; Vd.x = dimx; Vd.y = dimy; Vd.z = dimz; Bd.x = BLOCK_DIM_X; Bd.y = BLOCK_DIM_Y; Bd.z = BLOCK_DIM_Z; Gd.x = (dimx + Bd.x – 1) / Bd.x; Gd.y = (dimy + Bd.y – 1) / Bd.y; Gd.z = (dimz + Bd.z – 1) / Bd.z; wave_propagation >>(next, in, prev, velocity, Vd); } VSCSE 2012

MPI Sending and Receiving Data int MPI_Sendrecv(void *sendbuf, int sendcount, MPI_Datatype sendtype, int dest, int sendtag, void *recvbuf, int recvcount, MPI_Datatype recvtype, int source, int recvtag, MPI_Comm comm, MPI_Status *status) – Sendbuf:Initial address of send buffer (choice) – Sendcount: Number of elements in send buffer (integer) – Sendtype: Type of elements in send buffer (handle) – Dest: Rank of destination (integer) – Sendtag: Send tag (integer) – Recvcount: Number of elements in receive buffer (integer) – Recvtype: Type of elements in receive buffer (handle) – Source: Rank of source (integer) – Recvtag: Receive tag (integer) – Comm: Communicator (handle) – Recvbuf: Initial address of receive buffer (choice) – Status: Status object (Status). This refers to the receive operation. 39VSCSE 2012

MPI Sending and Receiving Data int MPI_Sendrecv(void *sendbuf, int sendcount, MPI_Datatype sendtype, int dest, int sendtag, void *recvbuf, int recvcount, MPI_Datatype recvtype, int source, int recvtag, MPI_Comm comm, MPI_Status *status) – Sendbuf:Initial address of send buffer (choice) – Sendcount: Number of elements in send buffer (integer) – Sendtype: Type of elements in send buffer (handle) – Dest: Rank of destination (integer) – Sendtag: Send tag (integer) – Recvcount: Number of elements in receive buffer (integer) – Recvtype: Type of elements in receive buffer (handle) – Source: Rank of source (integer) – Recvtag: Receive tag (integer) – Comm: Communicator (handle) – Recvbuf: Initial address of receive buffer (choice) – Status: Status object (Status). This refers to the receive operation. 40 Node VSCSE 2012

Stencil Code: Compute Process (IV) 41 /* Send data to left, get data from right */ MPI_Sendrecv(output + num_ghost_points, num_ghost_points, MPI_FLOAT, left_neighbor, i, output, num_ghost_points, MPI_FLOAT, right_neighbor, i, MPI_COMM_WORLD, &status ); /* Send data to right, get data from left */ MPI_Sendrecv(h_output + right_ghost_offset, num_ghost_points, MPI_FLOAT, right_neighbor, i, h_output + right_ghost_offset + num_ghost_points, num_ghost_points, MPI_FLOAT, left_neighbor, i, MPI_COMM_WORLD, &status ); float *temp = output; output = prev; prev = input; input = temp; } VSCSE 2012

Stencil Code: Compute Process (V) 42 /* Wait for previous communications */ MPI_Barrier(MPI_COMM_WORLD); float *temp = output; output = input; input = temp; /* Send the output, skipping ghost points */ float *send_address = output + num_ghost_points; MPI_Send(send_address, dimx * dimy * dimz, MPI_REAL, server_process, DATA_COLLECT, MPI_COMM_WORLD); /* Release resources */ gmacFree(input); gmacFree(output); gmacFree(prev); gmacFree(velocity); } VSCSE 2012

GETTING SOME OVERLAP VSCSE

Stencil Code: Compute Process (III-bis) 44 for(int i = 0; i < nreps; i++) { /* Compute values on the left of the volume */ launch_kernel(d_output + num_ghost_points, d_input + volume_offset, d_prev + volume_offset, dimx, dimy, dimz); /* Wait for the computation to be done */ gmacThreadSynchronize(); /* Compute the boundary conditions */ launch_boundaies(d_output + num_ghost_points, d_input + num_ghost_points, d_prev + num_ghost_points, dimx, dimy, dimz); VSCSE 2012

Stencil Code: Compute Process (IV-bis) 45 /* Send data to left, get data from right */ MPI_Sendrecv(output + num_ghost_points, num_ghost_points, MPI_FLOAT, left_neighbor, i, output, num_ghost_points, MPI_FLOAT, right_neighbor, i, MPI_COMM_WORLD, &status ); /* Send data to right, get data from left */ MPI_Sendrecv(h_output + right_ghost_offset, num_ghost_points, MPI_FLOAT, right_neighbor, i, h_output + right_ghost_offset + num_ghost_points, num_ghost_points, MPI_FLOAT, left_neighbor, i, MPI_COMM_WORLD, &status ); float *temp = output; output = prev; prev = input; input = temp; } VSCSE 2012

QUESTIONS? VSCSE