Introduction to Parallel Programming with MPI Morris Law, SCID Apr 25, 2018
Multi-core programming Currently, most CPUs has multiple cores that can be utilized easily by compiling with openmp support Programmers no longer need to rewrite a sequential code but to add directives to instruct the compiler for parallelizing the code with openmp. For reference site: http://bisqwit.iki.fi/story/howto/openmp/
Openmp example /* printf("Completed array init.\n"); * Sample program to test runtime of simple matrix multiply printf("Crunching without OMP..."); * with and without OpenMP on gcc-4.3.3-tdm1 (mingw) fflush(stdout); * compile with gcc –fopenmp start = omp_get_wtime(); * (c) 2009, Rajorshi Biswas */ #include <stdio.h> temp = 0; #include <stdlib.h> for(k=0; k<n; ++k) { #include <time.h> temp += arr1[i][k] * arr2[k][j]; #include <assert.h> arr3[i][j] = temp; #include <omp.h> int main(int argc, char **argv) { end = omp_get_wtime(); int i,j,k; printf(" took %f seconds.\n", end-start); int n; printf("Crunching with OMP..."); double temp; double start, end, run; printf("Enter dimension ('N' for 'NxN' matrix) (100-2000): "); #pragma omp parallel for private(i, j, k, temp) scanf("%d", &n); assert( n >= 100 && n <= 2000 ); int **arr1 = malloc( sizeof(int*) * n); int **arr2 = malloc( sizeof(int*) * n); int **arr3 = malloc( sizeof(int*) * n); for(i=0; i<n; ++i) { arr1[i] = malloc( sizeof(int) * n ); arr2[i] = malloc( sizeof(int) * n ); arr3[i] = malloc( sizeof(int) * n ); } printf("Populating array with random values...\n"); srand( time(NULL) ); for(j=0; j<n; ++j) { return 0; arr1[i][j] = (rand() % n); arr2[i][j] = (rand() % n);
Compiling for openmp support GCC gcc –fopenmp –o foo foo.c gfortran –fopenmp –o foo foo.f Intel Compiler icc -openmp –o foo foo.c ifort –openmp –o foo foo.f PGI Compiler pgcc -mp –o foo foo.c pgf90 –mp –o foo foo.f
What is Message Passing Interface (MPI)? Portable standard for communication Processes can communicate through messages. Each process is a separable program All data is private
What is Message Passing Interface (MPI)? This is a library, not a language!! Different compilers, but all must use the same libraries, i.e. MPICH, LAM, OPENMPI etc. Use standard sequential language. Fortran, C, C++, etc.
Basic Idea of Message Passing Interface (MPI) MPI Environment Initialize, manage, and terminate communication among processes Communication between processes Point to point communication, i.e. send, receive, etc. Collective communication, i.e. broadcast, gather, etc. Complicated data structures Communicate the data effectively e.g. matrices and memory
Message Passing Model Serial Message Passing Process time Process 0 Data exchange via interconnection Message Passing
General MPI program structure MPI include file variable declarations Initialize MPI environment Do work and make message passing calls Terminate MPI Environment #include <mpi.h> int main (int argc, char *argv[]) { int np, rank, ierr; ierr = MPI_Init(&argc, &argv); MPI_Comm_rank(MPI_COMM_WORLD,&rank); MPI_Comm_size(MPI_COMM_WORLD,&np); /* Do Some Works */ printf(“Helloworld, I’m P%d of %d\n”,rank,np); ierr = MPI_Finalize(); } Helloworld, I’m P0 of 3 Helloworld, I’m P1 of 3 Helloworld, I’m P2 of 3
When Use MPI? You need a portable parallel program You are writing a parallel library You care about performance You have a problem that can be solved in parallel ways
F77/F90, C/C++ MPI library calls Fortran 77/90 uses subroutines CALL is used to invoke the library call Nothing is returned, the error code variable is the last argument All variables are passed by reference C/C++ uses functions Just the name is used to invoke the library call The function returns an integer value (an error code) Variables are passed by value, unless otherwise specified
Types of Communication Point to Point Communication communication involving only two processes. Collective Communication communication that involves a group of processes.
Implementation of MPI
Browse the sample files Inside your home directory, the sample zip file, mpi-1.zip has been stored for the laboratory. Please unzip the file unzip mpi-1.zip There shall be 4 subdirectories inside mpi-1 ls –l mpi-1 total 20 drwxr-xr-x. 2 morris dean 4096 Nov 27 09:55 00openmp drwxrwxr-x. 2 morris dean 4096 Nov 27 09:41 0-hello drwxrwxr-x. 2 morris dean 4096 Nov 27 09:51 array-pi drwxrwxr-x. 2 morris dean 4096 Nov 27 09:49 mc-pi drwxrwxr-x. 2 morris dean 4096 Nov 27 09:51 series-pi The 4 subdirectories stored sample mpi-1 programs with README files for this laboratory.
First MPI C program:- hello1.c Change directory to hello and use an editor, e.g. nano to open hello1.c cd hello nano hello1.c #include <stdio.h> #include <mpi.h> int main(int argc, char *argv[]) { int version, subversion; MPI_Init(&argc, &argv); MPI_Get_version(&version, &subversion); printf("Hello world!\n"); printf("Your MPI Version is: %d.%d\n", version, subversion); MPI_Finalize(); return(0); }
First MPI Fortran program:- hello1.f Use an editor to open hello1.f cd hello nano hello1.f program main include 'mpif.h' integer ierr, version, subversion call MPI_INIT(ierr) call MPI_GET_VERSION(version, subversion, ierr) print *, 'Hello world!' print *, 'Your MPI Version is: ', version, '.', subversion call MPI_FINALIZE(ierr) end
Second MPI C program:- hello2.c Use an editor to open hello2.c cd hello nano hello2.c #include <stdio.h> #include <mpi.h> int main(int argc, char *argv[]) { int rank, size; MPI_Init(&argc, &argv); MPI_Comm_rank(MPI_COMM_WORLD, &rank); MPI_Comm_size(MPI_COMM_WORLD, &size); printf("Hello world! I am P%d of %d\n", rank, size); MPI_Finalize(); return(0); }
Second MPI Fortran program:- hello2.f Use an editor to open hello2.f cd hello nano hello2.f program main include 'mpif.h' integer rank, size, ierr call MPI_INIT(ierr) call MPI_COMM_RANK(MPI_COMM_WORLD, rank, ierr) call MPI_COMM_SIZE(MPI_COMM_WORLD, size, ierr) print *, 'Hello world! I am P', rank, ' of ', size call MPI_FINALIZE(ierr) end
Make all files in hello ‘Makefile’ is written for each example directory. Run ‘make’ will compile all hello examples make /usr/lib64/mpich/bin/mpif77 -o helloF1 hello1.f /usr/lib64/mpich/bin/mpif77 -o helloF2 hello2.f /usr/lib64/mpich/bin/mpicc -o helloC1 hello1.c /usr/lib64/mpich/bin/mpicc -o helloC2 hello2.c
mpirun hello examples in foreground You may run the hello examples in foreground by specifying the no of processors and the machinefile with mpirun. e.g. mpirun –np 4 –machinefile machine ./helloC2 Hello world! I am P0 of 4 Hello world! I am P2 of 4 Hello world! I am P3 of 4 Hello world! I am P1 of 4 machine is the file storing the hostname you want the programs run.
Exercise Follow the above hello example, mpirun helloC1, helloF1 and helloF2 in foreground with 4 processors in foreground Change directory to mc-pi, compile all programs inside using ‘make’ Run mpi-mc-pi using 2,4,8 processors. Change directory to series-pi, Run series-pi using 2,4,6,8 processors. Note the time difference
Parallelization example: serial-pi.c #include <stdio.h> static long num_steps = 10000000; double step; int main () { int i; double x, pi, sum = 0.0; step = 1.0/(double) num_steps; for (i=0;i< num_steps; i++){ x = (i+0.5)*step; sum = sum + 4.0/(1.0+x*x); } pi = step * sum; printf("Est Pi= %f\n",pi); 22 22
Parallelizing serial-pi. c into mpi-pi Parallelizing serial-pi.c into mpi-pi.c:- Step 1: Adding MPI environment #include "mpi.h" #include <stdio.h> static long num_steps = 10000000; double step; int main (int argc, char *argv[]) { int i; double x, pi, sum = 0.0; MPI_Init(&argc,&argv); step = 1.0/(double) num_steps; for (i=0;i< num_steps; i++){ x = (i+0.5)*step; sum = sum + 4.0/(1.0+x*x); } pi = step * sum; printf("Est Pi= %f\n",pi); MPI_Finalize();
Parallelizing serial-pi. c into mpi-pi Parallelizing serial-pi.c into mpi-pi.c :- Step 2: Adding variables to print ranks #include "mpi.h" #include <stdio.h> static long num_steps = 10000000; double step; int main (int argc, char *argv[]) { int i; double x, pi, sum = 0.0; int rank, size; MPI_Init(&argc,&argv); MPI_Comm_rank(MPI_COMM_WORLD, &rank); MPI_Comm_size(MPI_COMM_WORLD, &size); step = 1.0/(double) num_steps; for (i=0;i< num_steps; i++){ x = (i+0.5)*step; sum = sum + 4.0/(1.0+x*x); } pi = step * sum; printf("Est Pi= %f, Processor %d of %d \n",pi, rank, size); MPI_Finalize();
Parallelizing serial-pi.c into mpi-pi.c :- Step 3: divide the workload #include "mpi.h" #include <stdio.h> static long num_steps = 10000000; double step; int main (int argc, char *argv[]) { int i; double x, mypi, pi, sum = 0.0; int rank, size; MPI_Init(&argc,&argv); MPI_Comm_rank(MPI_COMM_WORLD, &rank); MPI_Comm_size(MPI_COMM_WORLD, &size); step = 1.0/(double) num_steps; for (i=rank;i< num_steps; i+=size){ x = (i+0.5)*step; sum = sum + 4.0/(1.0+x*x); } mypi = step * sum; printf("Est Pi= %f, Processor %d of %d \n",mypi, rank, size); MPI_Finalize();
Parallelizing serial-pi. c into mpi-pi Parallelizing serial-pi.c into mpi-pi.c :- Step 4: collect partial results #include "mpi.h" #include <stdio.h> static long num_steps = 10000000; double step; int main (int argc, char *argv[]) { int i; double x, mypi, pi, sum = 0.0; int rank, size; MPI_Init(&argc,&argv); MPI_Comm_rank(MPI_COMM_WORLD, &rank); MPI_Comm_size(MPI_COMM_WORLD, &size); step = 1.0/(double) num_steps; for (i=rank;i< num_steps; i+=size){ x = (i+0.5)*step; sum = sum + 4.0/(1.0+x*x); } mypi = step * sum MPI_Reduce(&mypi, &pi, 1, MPI_DOUBLE, MPI_SUM, 0, MPI_COMM_WORLD); if (rank==0) printf("Est Pi= %f, \n",pi); MPI_Finalize();
Compile and run mpi program $ mpicc –o mpi-pi mpi-pi.c $ mpirun -np 4 -machinefile machines mpi-pi
Parallelization example 2: serial-mc-pi.c #include <stdio.h> #include <stdlib.h> #include <time.h> main(int argc, char *argv[]) { long in,i,n; double x,y,q; time_t now; in = 0; srand(time(&now)); printf("Input no of samples : "); scanf("%ld",&n); for (i=0;i<n;i++) x = rand()/(RAND_MAX+1.0); y = rand()/(RAND_MAX+1.0); if ((x*x + y*y) < 1) in++; } q = ((double)4.0)*in/n; printf("pi = %.20lf\n",q); printf("rmse = %.20lf\n",sqrt(( (double) q*(4-q))/n)); 2r
Parallelization example 2: mpi-mc-pi.c #include "mpi.h" #include <stdio.h> #include <stdlib.h> #include <time.h> main(int argc, char *argv[]) { long in,i,n; double x,y,q,Q; time_t now; int rank,size; MPI_Init(&argc, &argv); in = 0; MPI_Comm_size(MPI_COMM_WORLD,&size); MPI_Comm_rank(MPI_COMM_WORLD,&rank); srand(time(&now)+rank); if (rank==0) { printf("Input no of samples : "); scanf("%ld",&n); } MPI_Bcast(&n,1,MPI_LONG,0,MPI_COMM_WORLD); for (i=0;i<n;i++) x = rand()/(RAND_MAX+1.0); y = rand()/(RAND_MAX+1.0); if ((x*x + y*y) < 1) in++; q = ((double)4.0)*in/n; MPI_Reduce(&q,&Q,1,MPI_DOUBLE,MPI_SUM,0,MPI_COMM_WORLD); Q = Q / size; if (rank==0) { printf("pi = %.20lf\n",Q); printf("rmse = %.20lf\n",sqrt(( (double) Q*(4-Q))/n/size)); MPI_Finalize(); 2r
Compile and run mpi-mc-pi $ mpicc –o mpi-mc-pi mpi-mc-pi.c $ mpirun -np 4 -machinefile machines mpi-mc-pi
Collective communication scatter 1 3 5 7 105 reduction (e.g. PROD) distribute your data among the processes information of all processes is used to provide a condensed result by/for one process
MPI_Scatter Distributes data from root to all other tasks in a group int MPI_Scatter (void *sendbuf, int sendcnt, MPI_Datatype sendtype ,void *recvbuf,int recvcnt, MPI_Datatype recvtype, int root, MPI_Comm comm ) Input Parameters sendbuf address of send buffer (choice, significant only at root ) sendcnt no. of elements sent to each process (integer, significant only at root) sendtype data type of send buffer elements (significant only at root) (handle) recvcnt number of elements in receive buffer (integer) recvtype data type of receive buffer elements (handle) root rank of sending process (integer) comm communicator (handle) Output Parameter recvbuf address of receive buffer (choice)
MPI_Scatter(&a,1,MPI_INT,&m,1,MPI_INT,2,MPI_COMM_WORLD); Example: two vectors are distributed in order to prepare a parallel computation of their scalar product Data 3 2 a[3] a[2] 13 12 11 10 1 a[1] a[0] Processor m 3 2 a[3] a[2] 13 12 11 10 1 a[1] a[0] Processor m Data MPI_Scatter MPI_Scatter(&a,1,MPI_INT,&m,1,MPI_INT,2,MPI_COMM_WORLD);
MPI_Reduce Reduces values on all processes to a single value on root. int MPI_Reduce (void *sendbuf, void *recvbuf, int count, MPI_Datatype datatype, MPI_Op op, int root, MPI_Comm comm ) Input Parameters sendbuf address of send buffer (choice) count number of elements in send buffer (integer) datatype data type of elements of send buffer (handle) op reduce operation (handle) root rank of root process (integer) comm communicator (handle) Output Parameter recvbuf address of receive buffer (choice, significant only at root)
MPI_Reduce Example: calculation of the global minimum of the variables kept by all processes, calculation of a global sum, etc. 3 2 d c 9 5 1 b a Processor Data 3 2 d c 9 19 5 1 b a Processor Data MPI_Reduce op:MPI_SUM MPI_Reduce(&b,&d,1,MPI_INT,MPI_SUM,2,MPI_COMM_WORLD);
MPI Datatype MPI Datatype Corr. Datatype in C MPI_CHAR signed char MPI_SHORT signed short int MPI_INT signed int MPI_LONG signed long int MPI_UNSIGNED_CHAR unsigned char MPI_UNSIGNED_SHORT unsigned short int MPI_UNSIGNED unsigned int MPI_UNSIGNED_LONG unsigned long int MPI_FLOAT float MPI_DOUBLE double MPI_LONG_DOUBLE long double
Thanks Please give me some comments https://goo.gl/forms/880NY3kZ9h7ay7r32 Morris Law, SCID (morris@hkbu.edu.hk)