KAUST Winter Enhancement Program 2010 (WE 244) MPI and OpenMP Craig C. Douglas School of Energy Resources Department of Mathematics University of Wyoming
What is MPI? MPI: Message Passing Interface MPI is not a new programming language, but a library with functions that can be called from C/C++/Fortran/Python Successor to PVM (Parallel Virtual Machine ) Developed by an open, international forum with representation from industry, academia, and government laboratories.
What Is It Good For? Allows data to be passed between processes in a distributed memory environment Provides source-code portability Allows efficient implementation A great deal of functionality Support for heterogeneous parallel architectures
MPI Communicator Idea: Most functions use communicators Group of processors that are allowed to communicate to each other Most functions use communicators MPI_COMM_WORLD Note MPI Format: MPI_XXX var = MPI_Xxx(parameters); MPI_Xxx(parameters);
Getting Started Include MPI header file Initialize MPI environment Work: Make message passing calls Send Receive Terminate MPI environment
Include MPI header file Include File Include Include MPI header file #include <stdio.h> #include <stdlib.h> #include <mpi.h> int main(int argc, char** argv){ … } Initialize Work Terminate
Initialize MPI environment Include Initialize MPI environment int main(int argc, char** argv){ int numtasks, rank; MPI_Init (*argc,*argv) ; MPI_Comm_size(MPI_COMM_WORLD, &numtasks); MPI_Comm_rank(MPI_COMM_WORLD, &rank); ... } Initialize Work Terminate
Initialize MPI (cont.) MPI_Init (&argc,&argv) Include Not MPI functions called before this call. MPI_Comm_size(MPI_COMM_WORLD, &nump) A communicator is a collection of processes that can send messages to each other. MPI_COMM_WORLD is a predefined communicator that consists of all the processes running when the program execution begins. MPI_Comm_rank(MPI_COMM_WORLD, &myrank) In order for a process to find out its rank (its identification number). Include Initialize Work Terminate
Terminate MPI environment Include #include <stdio.h> #include <stdlib.h> #include <mpi.h> int main(int argc, char** argv){ … MPI_Finalize(); } Initialize Work No MPI functions called after this call. Terminate
Make message passing calls (Send, Receive) Let’s work with MPI Work: Make message passing calls (Send, Receive) Include if(my_rank != 0){ MPI_Send(data, strlen(data)+1, MPI_CHAR, dest, tag, MPI_COMM_WORLD); } else{ MPI_Recv(data, 100, MPI_CHAR, source, tag, MPI_COMM_WORLD, &status); Initialize Work Terminate
Work (cont.) int MPI_Send ( void* message, Include int count, MPI_Datatype datatype, int dest, int tag, MPI_Comm comm) Include Initialize Work int MPI_Recv ( void* message, int count, MPI_Datatype datatype, int source, int tag, MPI_Comm comm MPI_Status *status) Terminate
Hello World!! #include "mpi.h" int main(int argc, char* argv[]) { int my_rank, p, source, dest, tag = 0; char message[100]; MPI_Status status; MPI_Init(&argc, &argv); MPI_Comm_rank(MPI_COMM_WORLD, &my_rank); MPI_Comm_size(MPI_COMM_WORLD, &p); if (my_rank != 0) { /* Create message */ sprintf(message, “Hello from process %d!", my_rank); dest = 0; MPI_Send(message, strlen(message)+1, MPI_CHAR, dest, tag, MPI_COMM_WORLD); }else { for(source = 1; source < p; source++) { MPI_Recv(message, 100, MPI_CHAR, source, tag, MPI_COMM_WORLD, &status); printf("%s", message); }} MPI_Finalize(); }
Compile and Run MPI Compile mpicc mpi_hello.c Run mpirun –np 5 hello.exe Output $mpirun –np 5 hello.exe Hello from process 1! Hello from process 2! Hello from process 3! Hello from process 4!
More MPI Functions MPI_Bcast( void *m, int s, MPI_Datatype dt, int root, MPI_Comm) Sends a copy of the data in m on the process with rank root to each process in the communicator. MPI_Reduce( void *operand, void* result, int count, MPI_Datatype datatye, MPI_Op operator, int root, MPI_Comm comm) Combines the operands stored in the memory referenced by operand using operation operator and stores the result in res on process root. double MPI_Wtime( void) Returns a double precision value that represents the number of seconds that have elapsed since some point in the past. MPI_Barrier ( MPI_Comm comm) Each process in comm block until every process in comm has called it.
More Examples Trapezoidal Rule: Compute Pi Integral from a to b of a nonnegative function f(x) Approach: Estimating the area by partitioning the region into regular geometric shapes and then add the areas of the shapes Compute Pi
Compute PI #include <stdio.h> #include "mpi.h" #define PI 3.141592653589793238462643 #define PI_STR "3.141592653589793238462643" #define MAXLEN 40 #define f(x) (4./(1.+ (x)*(x))) void main(int argc, char *argv[]){ int N=0,rank,nprocrs,i,answer=1; double mypi,pi,h,sum, x, starttime,endtime,runtime,runtime_max; char buff[MAXLEN]; MPI_Init(&argc,&argv); MPI_Comm_rank(MPI_COMM_WORLD,&rank); printf(“CPU %d saying hello",rank); MPI_Comm_size(MPI_COMM_WORLD, &nprocrs); if(rank==0) printf("Using a total of %d CPUs",nprocrs);
Compute PI while(answer){ if(rank==0){ printf("This program computes pi as “ "4.*Integral{0->1}[1/(1+x^2)]"); printf("(Using PI = %s)",PI_STR); printf("Input the Number of intervals: N ="); fgets(buff,MAXLEN,stdin); sscanf(buff,"%d",&N); printf("pi will be computed with %d intervals on %d processors.", N ,nprocrs); } /*Procr 0 = P(0) gives N to all other processors*/ MPI_Bcast(&N,1,MPI_INT,0,MPI_COMM_WORLD); if(N<=0) goto end_program;
Compute PI starttime=MPI_Wtime(); sum=0.0; h=1./N; for(i=1+rank;i<=N;i+=nprocrs){ x=h*(i-0.5); sum+=f(x); } mypi=sum*h; endtime=MPI_Wtime(); runtime=endtime-starttime; MPI_Reduce(&mypi,&pi,1,MPI_DOUBLE,MPI_SUM,0,MPI_COMM_WORLD); MPI_Reduce(&runtime,&runtime_max,1,MPI_DOUBLE,MPI_MAX,0, MPI_COMM_WORLD); printf("Procr %d: runtime = %f",rank,runtime); fflush(stdout); if(rank==0){ printf("For %d intervals, pi = %.14lf,error=%g",N,pi,fabs(pi-PI));
Compute PI printf("computed in = %f secs",runtime_max); fflush(stdout); printf("Do you wish to try another run? (y=1;n=0)"); fgets(buff,MAXLEN,stdin); sscanf(buff,"%d",&answer); } /*processors wait while P(0) gets new input from user*/ MPI_Barrier(MPI_COMM_WORLD); MPI_Bcast(&answer,1,MPI_INT,0,MPI_COMM_WORLD); if(!answer) break; end_program: printf("\nProcr %d: Saying good-bye!\n",rank); if(rank==0) printf("\nEND PROGRAM\n"); MPI_Finalize(); }
Compile and Run Example 2 mpicc –o pi.exe pi.c $mpirun –np 2 pi.exe Procr 1 saying hello. Procr 0 saying hello Using a total of 2 CPUs This program computes pi as 4.*Integral{0->1}[1/(1+x^2)] (Using PI = 3.141592653589793238462643) Input the Number of intervals: N = 10 pi will be computed with 10 intervals on 2 processors Procr 0: runtime = 0.000003 Procr 1: runtime = 0.000003 For 10 intervals, pi = 3.14242598500110, error = 0.000833331 computed in = 0.000003 secs
OpenMP What does OpenMP stand for? Open specifications for Multi Processing It is an API with three main components Compiler directives Library routines Variables Used for writing multithreaded programs in shared memory environments
What do you need? What programming languages? What operating systems? C and C++ FORTRAN (77, 90, 95) What operating systems? UNIX based ones Windows Can I compile OpenMP code with gcc? Yes: gcc -o pgm.exe -fopenmp pgm.c
Some compilers for OpenMP Free Software Foundation (GNU) Intel Portland Group Compilers and Tools IBM XL SGI MIPSpro Sun Studio 10 Absoft Pro FortranMP
What It Does Program starts off with a master thread It runs for some amount of time When the master thread reaches a region where the work can be done concurrently It creates several threads They all do work in this region When the end of the region is reached All of the extra threads terminate The master thread continues
Example You (master thread) get a job moving boxes When you go to work you bring several “friends” (sub-threads) Who help you move the boxes On pay day You do not bring any friends and you get all of the money
OpenMP directives #pragma omp parallel for shared(y) Format example Always starts with #pragma omp Then the directive name parallel for Followed by a clause The clause is optional shared(y) At the end a newline
Directives list PARALLEL DO/for SECTIONS SINGLE PARALLEL DO/for Multiple threads will execute on the code DO/for Causes the do or for loop to be executed in parallel by the worker threads SECTIONS Each section will be executed by multiple threads SINGLE Only to be executed by one thread PARALLEL DO/for Contains only one DO/for loop in the block PARALLEL SECTIONS Contains only one section in the block
Work Sharing
Work Sharing
Work Sharing
Data scope attribute clauses PRIVATE Variables declared in this block are independent for each thread SHARED Variables declared in this block are shared for each thread DEFAULT Allows a scope for all variables in the block FIRSTPRIVATE PRIVATE that has initialization of the variables LASTPRIVATE PRIVATE that copies the value from the last loop through the block is copied to the original object COPYIN Assign the same value to a variable independent for each thread REDUCTION Applies the variable to all the private copies of a shared variable
Directives and clauses
Synchronization MASTER CRITICAL BARRIER ATOMIC FLUSH ORDERED Only the master thread can execute this block CRITICAL Only one thread can execute this block at a time BARRIER Causes all of the threads to wait at this point until all of the threads reaches this point ATOMIC The memory location will be written one thread at a time FLUSH The view of memory must be consistent ORDERED The loop will be executed as if it was serially executed
Environment Variables OMP_SCHEDULE Number of runs through a loop OMP_NUM_THREADS Number of threads OMP_DYNAMIC If dynamic number of thread is allowed OMP_NESTED If nested parallelism is allowed
Library Routines OMP_SET_NUM_THREADS OMP_GET_NUM_THREADS OMP_GET_MAX_THREADS OMP_GET_THREAD_NUM OMP_GET_NUM_PROCS OMP_IN_PARALLEL OMP_SET_DYNAMIC OMP_GET_DYNAMIC OMP_SET_NESTED OMP_GET_NESTED OMP_INIT_LOCK OMP_DESTROY_LOCK OMP_SET_LOCK OMP_UNSET_LOCK OMP_TEST_LOCK
Example http://beowulf.lcs.mit.edu/18.337/beowulf.html #include <math.h> #include <stdio.h> #define N 16384 #define M 10 double dotproduct(int, double *); double dotproduct(int i, double *x) { double temp=0.0, denom; int j; for (j=0; j<N; j++) // zero based!! denom = (i+j)*(i+j+1)/2 + i+1; temp = temp + x[j]*(1/denom); } return temp; } int main() { double *x = new double[N]; double *y = new double[N]; double eig = sqrt(N); double denom,temp; int i,j,k; for (i=0; i<N; i++) {x[i] = 1/eig; } for (k=0;k<M;k++) { y[i]=0; // compute y = Ax #pragma omp parallel for shared(y) for (i=0; i<N; i++) { y[i] = dotproduct(i,x); } // find largest eigenvalue of y eig = 0; for (i=0; i<N; i++) { eig = eig + y[i]*y[i]; } eig = sqrt(eig); printf("The largest eigenvalue after %2d iteration is %16.15e\n",k+1, eig); // normalize for (i=0; i<N; i++) { x[i] = y[i]/eig; } }
OpenMP References Book: Using OpenMP: Portable Shared Memory Parallel Programming, Barbara Chapman, Gabriele Jost, Ruud van der Pas, and David Kuck https://computing.llnl.gov/tutorials/openMP http://openmp.org/wp/resources/#Tutorials http://beowulf.lcs.mit.edu/18.337/beowulf.html http://www.compunity.org/resources/compilers/index.php
MPI References Book: Parallel Programming with MPI, Peter Pacheco https://computing.llnl.gov/tutorials/mpi http://www-unix.mcs.anl.gov/mpi www.openmpi.org http://alliance.osc.edu/impi/ http://rocs.acomp.usf.edu/tut/mpi.php http://www.lam-mpi.org/tutorials/nd