MPI and OpenMP By: Jesus Caban and Matt McKnight
What is MPI? MPI: Message Passing Interface –Is not a new programming language, is a library with functions that can be called from C/Fortran/Python –Successor to PVM (Parallel Virtual Machine ) –Developed by an open, international forum with representation from industry, academia and government laboratories.
What it’s for? Allows data to be passed between processes in a distributed memory environment Provides source-code portability Allows efficient implementation A great deal of functionality Support for heterogeneous parallel architectures
MPI Communicator Idea: –Group of processors that are allowed to communicate to each other Most often use communicators –MPI_COMM_WORLD Note MPI Format : MPI_XXX var = MPI_Xxx(parameters); MPI_Xxx(parameters);
Getting Started Include MPI header file Initialize MPI environment Work: Make message passing calls Send Receive Terminate MPI environment
Include File Include Initialize Work Terminate Include MPI header file #include int main(int argc, char** argv){ … }
Initialize MPI Include Initialize Work Terminate Initialize MPI environment int main(int argc, char** argv){ int numtasks, rank; MPI_Init (*argc,*argv) ; MPI_Comm_size(MPI_COMM_WORLD, &numtasks); MPI_Comm_rank(MPI_COMM_WORLD, &rank);... }
Initialize MPI (cont.) Include Initialize Work Terminate MPI_Init (&argc,&argv) Not MPI functions called before this call. MPI_Comm_size(MPI_COMM_WORLD, &nump) A communicator is a collection of processes that can send messages to each other. MPI_COMM_WORLD is a predefined communicator that consists of all the processes running when the program execution begins. MPI_Comm_rank(MPI_COMM_WORLD, &myrank) In order for a process to find out its rank.
Terminate MPI environment Include Initialize Work Terminate Terminate MPI environment #include int main(int argc, char** argv){ … MPI_Finalize(); } No MPI functions called after this call.
Let’s work with MPI Include Initialize Work Terminate Work: Make message passing calls (Send, Receive) if(my_rank != 0){ MPI_Send(data, strlen(data)+1, MPI_CHAR, dest, tag, MPI_COMM_WORLD); } else{ MPI_Recv(data, 100, MPI_CHAR, source, tag, MPI_COMM_WORLD, &status); }
Work (cont.) Include Initialize Work Terminate int MPI_Send ( void* message, int count, MPI_Datatype datatype, int dest, int tag, MPI_Comm comm) int MPI_Recv ( void* message, int count, MPI_Datatype datatype, int source, int tag, MPI_Comm comm MPI_Status *status)
Hello World!! #include "mpi.h" int main(int argc, char* argv[]) { int my_rank, p, source, dest, tag = 0; char message[100]; MPI_Status status; MPI_Init(&argc, &argv); MPI_Comm_rank(MPI_COMM_WORLD, &my_rank); MPI_Comm_size(MPI_COMM_WORLD, &p); if (my_rank != 0) { /* Create message */ sprintf(message, “Hello from process %d!", my_rank); dest = 0; MPI_Send(message, strlen(message)+1, MPI_CHAR, dest, tag, MPI_COMM_WORLD); }else { for(source = 1; source < p; source++) { MPI_Recv(message, 100, MPI_CHAR, source, tag, MPI_COMM_WORLD, &status); printf("%s", message); }} MPI_Finalize(); }
Compile and Run MPI Compile –gcc –c hello.exe mpi_hello.c –lmpi –mpicc mpi_hello.c Run –mpirun –np 5 hello.exe Output $mpirun –np 5 hello.exe Hello from process 1! Hello from process 2! Hello from process 3! Hello from process 4!
More MPI Functions MPI_Bcast( void *m, int s, MPI_Datatype dt, int root, MPI_Comm) –Sends a copy of the data in m on the process with rank root to each process in the communicator. MPI_Reduce( void *operand, void* result, int count, MPI_Datatype datatye, MPI_Op operator, int root, MPI_Comm comm) –Combines the operands stored in the memory referenced by operand using operation operator and stores the result in res on process root. double MPI_Wtime( void) –Returns a double precision value that represents the number of seconds that have elapsed since some point in the past. MPI_Barrier ( MPI_Comm comm) –Each process in comm block until every process in comm has called it.
More Examples Trapezoidal Rule: –Integral from a to b of a nonnegative function f(x) –Approach: Estimating the area by partitioning the region into regular geometric shapes and then add the areas of the shapes Compute Pi
Compute PI #include #include "mpi.h" #define PI #define PI_STR " " #define MAXLEN 40 #define f(x) (4./(1.+ (x)*(x))) void main(int argc, char *argv[]){ int N=0,rank,nprocrs,i,answer=1; double mypi,pi,h,sum, x, starttime,endtime,runtime,runtime_max; char buff[MAXLEN]; MPI_Init(&argc,&argv); MPI_Comm_rank(MPI_COMM_WORLD,&rank); printf(“CPU %d saying hello",rank); MPI_Comm_size(MPI_COMM_WORLD, &nprocrs); if(rank==0) printf("Using a total of %d CPUs",nprocrs);
Compute PI while(answer){ if(rank==0){ printf("This program computes pi as “ "4.*Integral{0->1}[1/(1+x^2)]"); printf("(Using PI = %s)",PI_STR); printf("Input the Number of intervals: N ="); fgets(buff,MAXLEN,stdin); sscanf(buff,"%d",&N); printf("pi will be computed with %d intervals on %d processors.", N,nprocrs); } /*Procr 0 = P(0) gives N to all other processors*/ MPI_Bcast(&N,1,MPI_INT,0,MPI_COMM_WORLD); if(N<=0) goto end_program;
Compute PI starttime=MPI_Wtime(); sum=0.0; h=1./N; for(i=1+rank;i<=N;i+=nprocrs){ x=h*(i-0.5); sum+=f(x); } mypi=sum*h; endtime=MPI_Wtime(); runtime=endtime-starttime; MPI_Reduce(&mypi,&pi,1,MPI_DOUBLE,MPI_SUM,0,MPI_COMM_WORLD); MPI_Reduce(&runtime,&runtime_max,1,MPI_DOUBLE,MPI_MAX,0, MPI_COMM_WORLD); printf("Procr %d: runtime = %f",rank,runtime); fflush(stdout); if(rank==0){ printf("For %d intervals, pi = %.14lf,error=%g",N,pi,fabs(pi-PI));
Compute PI printf("computed in = %f secs",runtime_max); fflush(stdout); printf("Do you wish to try another run? (y=1;n=0)"); fgets(buff,MAXLEN,stdin); sscanf(buff,"%d",&answer); } /*processors wait while P(0) gets new input from user*/ MPI_Barrier(MPI_COMM_WORLD); MPI_Bcast(&answer,1,MPI_INT,0,MPI_COMM_WORLD); if(!answer) break; } end_program: printf("\nProcr %d: Saying good-bye!\n",rank); if(rank==0) printf("\nEND PROGRAM\n"); MPI_Finalize(); }
Compile and Run Example 2 Compile –gcc –c pi.exe pi.c –lmpi $mpirun –np 2 pi.exe Procr 1 saying hello. Procr 0 saying hello Using a total of 2 CPUs This program computes pi as 4.*Integral{0->1}[1/(1+x^2)] (Using PI = ) Input the Number of intervals: N = 10 pi will be computed with 10 intervals on 2 processors Procr 0: runtime = Procr 1: runtime = For 10 intervals, pi = , error = computed in = secs
What is Similar to MPI, but used for shared memory parallelism Simple set of directives Incremental parallelism Unfortunately only works with proprietary compilers… ?
Compilers and Platforms Compilers and Platforms Fujitsu/Lahey Fortran, C and C++ –Intel Linux SystemsIntel Linux Systems –Sun Solaris SystemsSun Solaris Systems HP HP-UX PA-RISC/Itanium –FortranFortran –CC –aC++aC++ HP Tru64 Unix –FortranFortran –CC –C++C++ IBM XL Fortran and C from IBMXL FortranC –IBM AIX Systems Intel C++ and Fortran Compilers from IntelIntel C++ and Fortran Compilers –Intel IA32 Linux Systems –Intel IA32 Windows Systems –Intel Itanium-based Linux Systems –Intel Itanium-based Windows Systems Guide Fortran and C/C++ from Intel's KAI Softare LabGuide –Intel Linux Systems –Intel Windows Systems PGF77 and PGF90 Compilers from The Portland Group, Inc. (PGI)PGF77 and PGF90 Compilers –Intel Linux Systems –Intel Solaris Systems –Intel Windows/NT Systems SGI MIPSpro 7.4 Compilers –SGI IRIX Systems Sun Microsystems Sun ONE Studio 8, Compiler Collection, Fortran 95, C, and C++Sun ONE Studio 8, Compiler Collection, Fortran 95, C, and C++ –Sun Solaris Platforms –Compiler Collection PortalCompiler Collection Portal VAST from Veridian Pacific-Sierra ResearchVeridian Pacific-Sierra Research –IBM AIX Systems –Intel IA32 Linux Systems –Intel Windows/NT Systems –SGI IRIX Systems –Sun Solaris Systems taken from
How do you use OpenMP? –C/C++ API Parallel Construct – when a ‘region’ of the program can be executed in multiple parallel threads, this fundamental construct starts the execution. #pragma omp parallel [clause[ [, ]clase] …] new-line structured-block The clause is one of the following: if (scalar–expression) private (variable-list) firstprivate (variable-list) default (shared | none) shared (variable-list) copyin (variable-list) reduction (operator : variable-list) num_threads (integer-expression)
for Construct –Defines an iterative work-sharing construct in which the iterations of the associated loop will execute in parallel. Sections Construct –Identifies a noniterative work-sharing construct that specifies a set of constructs that are to be divided among threads, each section being executed only once by each thread Fundamental Constructs
single Construct –associates a structured block’s execution with only one thread parallel for Construct –Shortcut for a parallel region containing only one for directive parallel sections Construct –Shortcut for a parallel region containing only a single sections directive
Master and Synchronization Directives master Construct –Specifies a structured block that is executed by the master thread of the team critical Construct –Restricts execution of the associated structured block to a single thread at a time barrier Directive –Synchronize all threads in a team. When this construct is encountered, all threads wait until the others have reached this point.
atomic Construct –Ensures that a specific memory location is updated ‘atomically’ (meaning only one thread is allowed write-access at a time) flush Directive –Specifies a “cross-thread” sequence point at which all threads in a team are ensured a “clean” view of certain objects in memory ordered Construct –A structured block following this directive will iterate in the same order as if executed in a sequential loop.
Data How do we control the data in this SMP environment? –threadprivate Directive makes files-scope and namespace-scope private to a thread Data-Sharing Attributes –private - private to each thread –firstprivate –lastprivate –shared – shared among all threads –default – User affects attributes –reduction – perform reduction on scalars –copyin – assign the same value to threadprivate variables –copyprivate – broadcast the value of a private variable from one member of a team to the others
Scalability test on SGI Origin 2000 Timing results of the dot product test in milliseconds for n = 16 *
Timing results of matrix times matrix test in milliseconds for n = 128
Architecture comparison From
References Book: Parallel Programming with MPI, Peter Pacheco www-unix.mcs.anl.gov/mpi