Parallel Processing - MPI

Slides:

Advertisements

Similar presentations

MINJAE HWANG THAWAN KOOBURAT CS758 CLASS PROJECT FALL 2009 Extending Task-based Programming Model beyond Shared-memory Systems.

Advertisements

Parallel Programming in C with MPI and OpenMP

MPI Program Structure Self Test with solution. Self Test 1.How would you modify "Hello World" so that only even-numbered processors print the greeting.

Collective Communications Self Test with solution.

Message-Passing Programming and MPI CS 524 – High-Performance Computing.

Comp 422: Parallel Programming Lecture 8: Message Passing (MPI)

Lecture 8 Objectives Material from Chapter 9 More complete introduction of MPI functions Show how to implement manager-worker programs Parallel Algorithms.

1 Tuesday, October 10, 2006 To err is human, and to blame it on a computer is even more so. -Robert Orben.

Basics of Message-passing Mechanics of message-passing –A means of creating separate processes on different computers –A way to send and receive messages.

Chapter 6 Parallel Sorting Algorithm Sorting Parallel Sorting Bubble Sort Odd-Even (Transposition) Sort Parallel Odd-Even Transposition Sort Related Functions.

2a.1 Message-Passing Computing More MPI routines: Collective routines Synchronous routines Non-blocking routines ITCS 4/5145 Parallel Computing, UNC-Charlotte,

ECE 1747H : Parallel Programming Message Passing (MPI)

1 MPI: Message-Passing Interface Chapter 2. 2 MPI - (Message Passing Interface) Message passing library standard (MPI) is developed by group of academics.

Part I MPI from scratch. Part I By: Camilo A. SilvaBIOinformatics Summer 2008 PIRE :: REU :: Cyberbridges.

Parallel Computing A task is broken down into tasks, performed by separate workers or processes Processes interact by exchanging information What do we.

Parallel Programming with MPI Prof. Sivarama Dandamudi School of Computer Science Carleton University.

Hybrid MPI and OpenMP Parallel Programming

CS 838: Pervasive Parallelism Introduction to MPI Copyright 2005 Mark D. Hill University of Wisconsin-Madison Slides are derived from an online tutorial.

MPI Communications Point to Point Collective Communication Data Packaging.

Message Passing Programming Model AMANO, Hideharu Textbook pp. １４０－１４７.

Summary of MPI commands Luis Basurto. Large scale systems Shared Memory systems – Memory is shared among processors Distributed memory systems – Each.

MPI Introduction to MPI Commands. Basics – Send and Receive MPI is a message passing environment. The processors’ method of sharing information is NOT.

Share Memory Program Example int array_size=1000 int global_array[array_size] main(argc, argv) { int nprocs=4; m_set_procs(nprocs); /* prepare to launch.

CS 591 x I/O in MPI. MPI exists as many different implementations MPI implementations are based on MPI standards MPI standards are developed and maintained.

Parallel Programming with MPI By, Santosh K Jena..

Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd Edition, by B. Wilkinson & M. Allen, ©

CSCI-455/522 Introduction to High Performance Computing Lecture 4.

1 Message Passing Models CEG 4131 Computer Architecture III Miodrag Bolic.

Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Parallel Programming in C with MPI and OpenMP Michael J. Quinn.

12.1 Parallel Programming Types of Parallel Computers Two principal types: 1.Single computer containing multiple processors - main memory is shared,

2.1 Collective Communication Involves set of processes, defined by an intra-communicator. Message tags not present. Principal collective operations: MPI_BCAST()

Timing in MPI Tarik Booker MPI Presentation May 7, 2003.

Chapter 5. Nonblocking Communication MPI_Send, MPI_Recv are blocking operations Will not return until the arguments to the functions can be safely modified.

3/12/2013Computer Engg, IIT(BHU)1 MPI-1. MESSAGE PASSING INTERFACE A message passing library specification Extended message-passing model Not a language.

April 24, 2002 Parallel Port Example. April 24, 2002 Introduction The objective of this lecture is to go over a simple problem that illustrates the use.

1 Parallel and Distributed Processing Lecture 5: Message-Passing Computing Chapter 2, Wilkinson & Allen, “Parallel Programming”, 2 nd Ed.

Message Passing Programming Based on MPI Collective Communication I Bora AKAYDIN

Message Passing Interface Using resources from

Lecture 3: Today’s topics MPI Broadcast (Quinn Chapter 5) –Sieve of Eratosthenes MPI Send and Receive calls (Quinn Chapter 6) –Floyd’s algorithm Other.

COMP7330/7336 Advanced Parallel and Distributed Computing MPI Programming - Exercises Dr. Xiao Qin Auburn University

COMP7330/7336 Advanced Parallel and Distributed Computing MPI Programming: 1. Collective Operations 2. Overlapping Communication with Computation Dr. Xiao.

ITCS 4/5145 Parallel Computing, UNC-Charlotte, B

1 ITCS4145 Parallel Programming B. Wilkinson March 23, hybrid-abw.ppt Hybrid Parallel Programming Introduction.

CS4402 – Parallel Computing

Introduction to MPI.

MPI Message Passing Interface

Send and Receive.

CMSC 611: Advanced Computer Architecture

Send and Receive.

ITCS 4/5145 Parallel Computing, UNC-Charlotte, B

Message Passing Models

Lecture 14: Inter-process Communication

May 19 Lecture Outline Introduce MPI functionality

Message-Passing Computing More MPI routines: Collective routines Synchronous routines Non-blocking routines ITCS 4/5145 Parallel Computing, UNC-Charlotte,

Pattern Programming Tools

Message-Passing Computing

Parallel Processing Dr. Guy Tel-Zur.

Lab Course CFD Parallelisation Dr. Miriam Mehl.

Introduction to parallelism and the Message Passing Interface

ITCS 4/5145 Parallel Computing, UNC-Charlotte, B

Hybrid MPI and OpenMP Parallel Programming

Hardware Environment VIA cluster - 8 nodes Blade Server – 5 nodes

Chapter 01: Introduction

Message-Passing Computing Message Passing Interface (MPI)

Parallel Processing - MPI

MPI Message Passing Interface

CS 584 Lecture 8 Assignment?.

Programming Parallel Computers

Presentation transcript:

Parallel Processing - MPI MPI Tutorial ACOE401 Parallel Processing - MPI 1 1

Parallel Processing - MPI Example 1: A 1024X1024 gray-scaled bitmap image is stored in the file “INPUT.BMP”. Write a program that displays the number of pixels that have a value greater than the average value of all pixels. (a) Write the program for a message passing system, using MPI without any of the collective communication functions. (b) Write the program for a message passing system, using MPI with the most appropriate collective communication functions. 3 3 1 2 2 6 17 ACOE401 Parallel Processing - MPI 2 2

Example 1a: Methodology Init.MPI Yes Id=0? No Read File Send data segments to others Rcve my data segment from root Calculate my local Sum Calculate my local Sum Receive local Sum from others and find Global Sum and Aver. Send local Sum to Root. Send Average to others Rcve Average from Root Count Pixels >Average Count Pixels >Average Receive local Counts from Others Find and display Global Count. Send my Count to Root. Exit ACOE401 Parallel Processing - MPI 3 3

Parallel Processing - MPI Example 1a: void main(int argc, char *argv) { int myid, nprocs, i, myrows, mysize,aver, num=0, mysum=0, temp; MPI_Init(&argc, &argv); MPI_Comm_size(MPI_COMM_WORLD, &nprocs); MPI_rank(MPI_COMM_WORLD, &myid); myrows=1024/nprocs; mysize = 1024*myrows; int lvals[myrows][1024]; if(myid==0) /************ Code executed only by process with id = 0 ***********/ { int dvals[1024] [1024]; get_data(filename,dvals); /*Copy data from data file to array dvals[] */ for(i=1; i<nprocs;i++) MPI_Send(dvals+i*mysize,mysize,MPI_INT,i,0,MPI_COMM_WORLD); for(i=0, i<myrows, i++) for(k=0, k<1024, k++) mysum +=dvals[i][k]; /*Calculate partial sum */ for(i=1; i<nprocs;i++){ MPI_Rcve(&temp,1,MPI_INT,i,1,MPI_COMM_WORLD,&stat); mysum +=temp; } /*Calculate global */ aver = mysum/(1024*1024); MPI_Send(&aver,1,MPI_INT,i,2,MPI_COMM_WORLD); if (dvals[i][k]>aver) num++; /*Count pixels less that average */ MPI_Rcve(&temp,1,MPI_INT,i,3,MPI_COMM_WORLD,&stat); num+=temp; } /*Calculate global number of pixels less than the average*/ printf(“Result = %d”,num); } /*****End of Code executed only by process 0 *****/ ACOE401 Parallel Processing - MPI 4 4

Parallel Processing - MPI Example 1a: Continued if(myid<>0) /***Code executed by all processes except process 0 *****/ { MPI_Rcve(lvals,mysize,MPI_INT,1,0,MPI_COMM_WORLD,&stat); for(i=0, i<myrows, i++) for(k=0, k<1024, k++) mysum +=lvals[i][k]; /*Calculate partial sum */ MPI_Send(&mysum,1,MPI_INT,0,1,MPI_COMM_WORLD); MPI_Rcve(&aver,1,MPI_INT,0,2,MPI_COMM_WORLD,&stat); if (lvals[i][k]>aver) num++; /*Count pixels less than the average*/ MPI_Send(&num,1,MPI_INT, 0,3,MPI_COMM_WORLD); } MPI_Finalize(); ACOE401 Parallel Processing - MPI 5 5

Example 1b: Methodology Init.MPI Yes Id=0? Read File Scatter data segments (Root = Process 0) Calculate my local Sum Reduce local Sum to Global Sum (Root = Process 0) Yes Id=0? Find Average Broadcast Average (Root = Process 0) Count local Pixels >Average Reduce local Count to Global Count (Root = Process 0) Yes Id=0? Display Global Count Exit ACOE401 Parallel Processing - MPI 6 6

Parallel Processing - MPI Example 1b: void main(int argc, char *argv) { int myid, nprocs, i, myrows, aver, res, num=0; int dvals[1024] [1024]; int mysum=0; int gsum = 0; MPI_Init(&argc, &argv); MPI_Comm_size(MPI_COMM_WORLD, &nprocs); MPI_rank(MPI_COMM_WORLD, &myid); if(myid==0) get_data(filename,dvals); /*Copy data from data file to array dvals[] */ myrows=1024/nprocs; int lvals[myrows][1024]; MPI_Scatter(dvals,1024*1024,MPI_INT,lvals, myrows*1024,MPI_INT,0,MPI_COMM_WORLD); for(i=0, i<myrows, i++) for(k=0, k<1024, k++) mysum +=lvals[i][k]; /*Calculate partial sum */ MPI_Reduce(&mysum, &sum, 1, MPI_INT, MPI_SUM,0,MPI_COMM_WORLD); if(myid = = 0) aver=sum/(1024*1024); MPI_Bcast(&aver, 1, MPI_INT, 0,MPI_COMM_WORLD); if (lvals[i][k]>aver) num++; /*Count number of pixels less than average */ MPI_Reduce(&res, &num, 1, MPI_INT, MPI_SUM,0,MPI_COMM_WORLD); if(myid = = 0) printf("The result is %d.\n",res); MPI_Finalize(); } ACOE401 Parallel Processing - MPI 7 7

Parallel Processing - MPI Example 2: An image enhancement technique on gray-scaled images, requires that all pixels that have a value less than the average value of all pixels of the image are set to zero, while the rest remain the same. You are required to write a program to implement the above technique. Use as input the file ‘input.dat’ that contains a 1024X1024 gray-scaled bitmap image. Store the new image in the file ‘output.dat’. Write the program for a message passing system, using MPI with the most appropriate collective communication functions. ACOE401 Parallel Processing - MPI 8

Parallel Processing - MPI Example 2: void main(int argc, char *argv) { int myid, nprocs, i, myrows, aver; int dvals[1024] [1024]; int mysum=0; int sum = 0; MPI_Init(&argc, &argv); MPI_Comm_size(MPI_COMM_WORLD, &nprocs); MPI_rank(MPI_COMM_WORLD, &myid); if(myid==0) get_data(filename,dvals); /*Copy data from data file to array dvals[] */ myrows=1024/nprocs; int lvals[myrows][1024]; MPI_Scatter(dvals,1024*1024,MPI_INT,lvals, myrows*1024,MPI_INT,0,MPI_COMM_WOLD); for(i=0, i<myrows, i++) for(k=0, k<1024, k++) mysum +=lvals[i][k]; /*Calculate partial sum */ MPI_Reduce(&mysum, &sum, 1, MPI_INT, MPI_SUM,0,MPI_COMM_WORLD); if(myrank = = 0) aver=sum/(1024*1024); MPI_Bcast(&aver, 1, MPI_INT, 0,MPI_COMM_WORLD); if (lvals[i][k]<aver) lvals[i][k]=0; /*If value less than average then set to 0 */ MPI_Gather(dvals,1024*1024,MPI_INT,lvals,myrows*1024,MPI_INT,0,MPI_COMM_WOLD); if(myrank = = 0) write_data(filename,dvals); MPI_Finalize(); } ACOE401 Parallel Processing - MPI 9 9

Parallel Processing - MPI Example 3: The following MPI program uses only the MPI_Send and MPI_Rcve functions to transfer data from between processes. Rewrite the program for a message passing system, using MPI with the most appropriate collective communication functions. void main(int argc, char *argv) { int myid, nprocs, i, myrows, mysize, aver, mkey = 0, num=0, mysum=0, temp; int lvals2[1024] [1024]; MPI_Init(&argc, &argv); MPI_Comm_size(MPI_COMM_WORLD, &nprocs); MPI_rank(MPI_COMM_WORLD, &myid); myrows=1024/nprocs; mysize = 1024*myrows; int lvals1[myrows][1024]; if(myid==0) { int dvals1[1024] [1024]; get_data(filename1,dvals1); int dvals2[1024] [1024]; get_data(filename2,dvals2); } …… ………see next slide ……… MPI_Finalize(); ACOE401 Parallel Processing - MPI 10

Parallel Processing - MPI Example 3: (cont.) if(myid==0) { for(i=1; i<nprocs;i++) { MPI_Send(dvals1+i*mysize,mysize,MPI_INT,i,0,MPI_COMM_WORLD); MPI_Send(dvals2,1024*1024,MPI_INT,i,1,MPI_COMM_WORLD); } for(i=0, i<myrows, i++) for(k=0, k<1024, k++) { dvals1[i][k]=dvals1[i][k] * dvals2[k][i]; mkey+= dvals1[i][k]; for(i=1; i<nprocs;i++){ MPI_Rcve(dvals1+i*mysize,mysize,MPI_INT,i,0,MPI_COMM_WORLD,&stat); MPI_Rcve(&temp,1,MPI_INT,i,1,MPI_COMM_WORLD,&stat); mkey +=temp; } aver = mkey/(1024*1024); printf(“Result = %d”,mkey; } if(myid<>0) { MPI_Rcve(lvals1,mysize,MPI_INT,1,0,MPI_COMM_WORLD,&stat); MPI_Rcve(lvals2,1024*1024,MPI_INT,1,1,MPI_COMM_WORLD,&stat); { lvals1[i][k]=lvals1[i][k] * lvals1[k][i]; mkey+= lvals1[i][k]; } MPI_Send(lvals1,mysize,MPI_INT,0,0,MPI_COMM_WORLD); MPI_Send(&mkey,1,MPI_INT,0,1,MPI_COMM_WORLD); } ACOE401 Parallel Processing - MPI 11 11

Parallel Processing - MPI Example 3: (Solution 1/2) if(myid==0) { for(i=1; i<nprocs;i++) { MPI_Send(dvals1+i*mysize,mysize,MPI_INT,i,0,MPI_COMM_WORLD); MPI_Send(dvals2,1024*1024,MPI_INT,i,1,MPI_COMM_WORLD); } for(i=0, i<myrows, i++) for(k=0, k<1024, k++) { dvals1[i][k]=dvals1[i][k] * dvals2[k][i]; mkey+= dvals1[i][k]; for(i=1; i<nprocs;i++){ MPI_Rcve(dvals1+i*mysize,mysize,MPI_INT,i,0,MPI_COMM_WORLD,&stat); MPI_Rcve(&temp,1,MPI_INT,i,1,MPI_COMM_WORLD,&stat); mkey +=temp; } aver = mkey/(1024*1024); printf(“Result = %d”,mkey; } if(myid<>0) { MPI_Rcve(lvals1,mysize,MPI_INT,1,0,MPI_COMM_WORLD,&stat); MPI_Rcve(lvals2,1024*1024,MPI_INT,1,1,MPI_COMM_WORLD,&stat); { lvals1[i][k]=lvals1[i][k] * lvals1[k][i]; mkey+= lvals1[i][k]; } MPI_Send(lvals1,mysize,MPI_INT,0,0,MPI_COMM_WORLD); MPI_Send(&mkey,1,MPI_INT,0,1,MPI_COMM_WORLD); } Scatter Broadcast Gather Reduce ACOE401 Parallel Processing - MPI 12 12

Parallel Processing - MPI Example 3: (Solution 2) void main(int argc, char *argv) { int myid, nprocs, i, myrows, mysize, aver, mkey = 0, num=0, mysum=0, temp; int lvals2[1024] [1024]; MPI_Init(&argc, &argv); MPI_Comm_size(MPI_COMM_WORLD, &nprocs); MPI_rank(MPI_COMM_WORLD, &myid); myrows=1024/nprocs; mysize = 1024*myrows; int lvals1[myrows][1024]; if(myid==0) { int dvals1[1024] [1024]; get_data(filename1,dvals1); /* int dvals2[1024] [1024]; get_data(filename2,lvals2); } MPI_Scatter(dvals1,1024*1024,MPI_INT,lvals1,myrows*1024,MPI_INT,0,MPI_COMM_WORLD); MPI_Bcast(lvals2,1024*1024,MPI_INT,0,MPI_COMM_WORLD); } for(i=0, i<myrows, i++) for(k=0, k<1024, k++) { lvals1[i][k]=lvals1[i][k] * lvals2[k][i]; mkey+= lvals1[i][k]; } MPI_Gather(dvals1,1024*1024,MPI_INT,lvals1,myrows*1024,0,MPI_COMM_WORLD,&stat); MPI_Reduce(&temp,&mkey,1,MPI_INT,MPI_SUM,0,MPI_COMM_WORLD,&stat); printf(“Result = %d”,mkey; MPI_Finalize(); ACOE401 Parallel Processing - MPI 13 13

Parallel Processing - MPI Example 4: Rewrite the program below for a message passing system, without using any of the collective communication functions. void main(int argc, char *argv) { int myid, nprocs, i, myrows, mysize, aver, num=0, mysum=0, temp; int arr1[1024][1024]; arr2[1024][1024]; MPI_Init(&argc, &argv); MPI_Comm_size(MPI_COMM_WORLD, &nprocs); MPI_rank(MPI_COMM_WORLD, &myid); myrows=1024/nprocs; mysize = 1024*myrows; int arr3[myrows][1024]; int arr4[myrows][1024]; if(myid==0) { get_data(filename1,arr1); get_data(filename2,arr2); } MPI_Scatter(arr1,1024*1024,MPI_INT,arr3,myrows*1024,MPI_INT,0,MPI_COMM_WORLD); MPI_Bcast(arr2,1024*1024,MPI_INT,0,MPI_COMM_WORLD); } for(i=0, i<myrows, i++) { mysum+=arr1[i][k]; for(k=0, k<1024, k++) { arr3[i][k]=arr1[i][k] + arr2[k][i]; } MPI_Allreduce(&temp,&mysum,1,MPI_INT,MPI_SUM,MPI_COMM_WORLD,&stat); aver = temp/1024; for(i=0, i<myrows, i++) if (arr3[i][k] < aver) arr4[i][k] = 0 else arr4[i][k] = arr3[i][k]; MPI_Gather(arr1,1024*1024,MPI_INT,arr2,myrows*1024,0,MPI_COMM_WORLD,&stat); ……. } ACOE401 Parallel Processing - MPI 14 14

Parallel Processing - MPI Answer 4: (part 1) void main(int argc, char *argv) { int myid, nprocs, i, myrows, mysize, aver, num=0, mysum=0, temp; int arr1[1024][1024]; arr2[1024][1024]; MPI_Init(&argc, &argv); MPI_Comm_size(MPI_COMM_WORLD, &nprocs); MPI_rank(MPI_COMM_WORLD, &myid); myrows=1024/nprocs; mysize = 1024*myrows; int arr3[myrows][1024]; int arr4[myrows][1024]; if(myid==0) {get_data(filename1,arr1); get_data(filename2,arr2); } // MPI_Scatter(arr1,1024*1024,MPI_INT,arr3,myrows*1024,MPI_INT,0,MPI_COMM_WORLD); if (myid == 0 { for (i = 0; i<nprocs; i++) /* process 0 sends also to itself*/ MPI_Send(arr1+i*myrows, mysize, MPI_INT,i,0,MPI_COMM_WORLD); else MPI_Rcve(arr3,mysize,MPI_INT,0,0,MPI_COMM_WORLD,&stat); } //MPI_Bcast(arr2,1024*1024,MPI_INT,0,MPI_COMM_WORLD); } if (myid == 0 { for (i = 1; i<nprocs; i++) /* no need for process 0 sends also to itself*/ MPI_Send(arr2, 1024*1024, MPI_INT,i,1,MPI_COMM_WORLD); else MPI_Rcve(arr2, 1024*1024,MPI_INT,0,1,MPI_COMM_WORLD,&stat); } for(i=0, i<myrows, i++) { mysum+=arr1[i][k]; for(k=0, k<1024, k++) { arr3[i][k]=arr1[i][k] + arr2[k][i]; } /……. ACOE401 Parallel Processing - MPI 15 15

Parallel Processing - MPI Answer 4: (part 2) / ……… //MPI_Allreduce(&temp,&mysum,1,MPI_INT,MPI_SUM,MPI_COMM_WORLD,&stat); if (myid == 0 { for(i=1; i<nprocs;i++){ MPI_Rcve(&temp,1,MPI_INT,i,2,MPI_COMM_WORLD,&stat); mysum +=temp; } aver = mysum/1024; for(i=1; i<nprocs;i++) MPI_Send(&aver,1,MPI_INT,i,3,MPI_COMM_WORLD); else { MPI_Send(&mysum,1,MPI_INT,i,2,MPI_COMM_WORLD); MPI_Rcve(&aver,1,MPI_INT,i,2,MPI_COMM_WORLD,&stat); } for(i=0, i<myrows, i++) for(k=0, k<1024, k++) if (arr3[i][k] < aver) arr4[i][k] = 0 else arr4[i][k] = arr3[i][k]; //MPI_Gather(arr1,1024*1024,MPI_INT,arr2,myrows*1024,0,MPI_COMM_WORLD,&stat); MPI_Rcve(arr1+i*myrows, mysize,,MPI_INT,i,4,MPI_COMM_WORLD,&stat); } else MPI_Send(arr2, mysize,,MPI_INT,i,4,MPI_COMM_WORLD); } ……. } ACOE401 Parallel Processing - MPI 16 16

Question 3 (Message Passing System) Figure below shows the timeline diagram of the execution of an MPI program with four processes Calculate the speedup achieved, the efficiency and the utilization factor for each process. List three reasons for achieving such a low performance. For each reason propose a change that will improve the performance. Parallel Processing -

Answer 3a (Message Passing System) Calculate the speedup achieved, the efficiency and the utilization factor for each process. Speedup=Computation Time/Parallel Time=(22+4+15+20+14)/54= 1.39 Efficiency = Speedup / Number of processors = 1.39 / 4 = 0.35 = 35% Utilization of P0 = 26/54 = 0.48 = 48% Utilization of P1 = 15/54 = 0.48 = 28% Utilization of P2 = 20/54 = 0.48 = 37% Utilization of P3 = 14/54 = 0.26 = 26% Parallel Processing

Answer 3b (Message Passing System) List three reasons for achieving such a low performance. For each reason propose a change that will improve the performance. Bad load balancing (P0 useful work = 26, P0 Useful work 14) Improve speedup by improving load balancing. P0 must be assigned less computation than the rest, since it has to spend time to distribute the data and then collect the results. Process P0 handles most of the communication using point to point communication. (see next slide) Use collective communications to obtain a balanced communication load. Processors are idle while waiting for a communication or synchronization event to be completed. (see next slide) Hide communication or synchronization latency by allowing the processor doing useful work while waiting for a long latency event. Use techniques such as non-blocking communication (pre-communication), Parallel Processing

Hiding Communication Latency in MPI In message passing system, communication latency reduces significantly the performance. MPI offers a number of directives that aim in hiding the communication latency and thus allow the processor do useful work while the communication is in progress. Non-Blocking Communication: MPI_Send and MPI_Rcve are blocking communication functions – Both nodes must wait until the data is transferred correctly. MPI supports Non-blocking communication with the MPI_ISend() and MPI_IRcve() functions. With the MPI_ISend() the transmitting node sends the data and continues. After executing a number of instructions it can check the status and act accordingly. With the MPI_IRcve() the receiving node executes the receive function before the data is needed, expecting that the data will arrive at the time it is needed. MPI supports data packing with the MPI_Pack() function that packs different types of data into a single message. This reduces the communication overheads since the number of messages is reduced. MPI supports collective communications that can take advantage of the network facilities. For example the MPI_Bcast() takes advantage of the broadcast function on an Ethernet network and thus send data to many nodes with a single message. The speedup of a parallel processing systems with N processors is said to be Linear if it is equal to N. Speedup is reduced by two factors: overheads and latencies. Overheads can be classified as parallelism overheads caused by process management, communication overheads caused by processors exchanging information, synchronization overheads caused by synchronization events and load imbalance overheads caused when some processors are busy while others are idle. Latency is the time that a processor is idle waiting for an event to be completed. Remote memory references, communication and synchronization introduce latencies in the execution of a program that reduces the overall throughput of a parallel processing system. The latency problems will get worse since processor speed increases at a much higher rate than the speed of memory and the interconnection network. Furthermore, latency grows as the number of processing elements increases since the communication to computation ratio increases Parallel Processing 20 20