Parallel Processing Javier Delgado Grid-Enabledment of Scientific Applications Professor S. Masoud Sadjadi
Parallel Processing - GCB Outline Why parallel processing Overview The Message Passing Interface (MPI) Introduction Basics Examples OpenMP Alternatives to MPI Parallel Processing - GCB
Why parallel processing? Computationally-intensive scientific applications Hurricane modelling Bioinformatics High-Energy Physics Physical limits of one processor There are many open areas in science that require massive computation power to solve. Many new areas have emerge recently, such as bioinformatics. As the professor has discussed in class, there are physical limitations to how many fast a single processor can go. Even if there were not, we are still many years away from having a single processor to solve these problems Parallel Processing - GCB
Types of Parallel Processing Shared Memory e.g. Multiprocessor computer Distributed Memory e.g. Compute Cluster Parallel Processing - GCB
Parallel Processing - GCB Shared Memory Advantages No explicit message passing Fast Disadvantages Scalability Synchronizaton Since all processors are on the same box, the user does not need to pass messages as in a distributed system. In many cases, a simple pragma statement will take care of everything. Also, since all load balancing is done in one processor, it is fast However, as more and more processors/cores are added, simultaneious access to memory can lead to bus saturation. Also, synchronizatino becomes a problem if multiple cores are reading and writing the same area in memory. Parallel Processing - GCB Source: http://kelvinscale.net
Parallel Processing - GCB Distributed Memory Advantages Each processor has its own memory Usually more cost-effective Disadvantages More programmer involvement Slower Parallel Processing - GCB
Parallel Processing - GCB Combination of Both Emerging trend Best and worst of both worlds As processors themselves are scaling out instead of up, we end up with a combination of shared memory and distributed memory Parallel Processing - GCB
Parallel Processing - GCB Outline Why parallel processing Overview The Message Passing Interface (MPI) Introduction Basics Examples OpenMP Alternatives to MPI Parallel Processing - GCB
Parallel Processing - GCB Message Passing Standard for Distributed Memory systems Networked workstations can communicate De Facto specification: The Message Passing Interface (MPI) Free MPI Implementations: MPICH OpenMPI LAM-MPI “Specification” is highlighted since MPI is not really an implementation. It is a specification of what the implementations should do. There are several implementations available today. Parallel Processing - GCB
Parallel Processing - GCB MPI Basics Design Virtues Defines communication, but not its hardware Expressive Performance Concepts No adding/removing of processors during computation Same program runs on all processors Single-Program, Multiple Data (SPMD) Multiple Instruction, Multiple Data (MIMD) Processes identified by “rank” MPI specifies communication directives that are allowed by the system, but it does not limit to any kind of hardware implementation. Whether you are using ethernet, myrinet, or even shared memory systems, although by default this is disabled in most implementations, as far as I know. It is designed such that programs can be written with a minimal subset of the specified functions. However, many powerful functions are provided for optimal performance and programming power. Since it is an open standard, a lot of thought when into its design. Also, it is optimized for parallel programs. It works with other compiler optimizations since standard system compilers are used. Number of nodes doing computation stays constant. This provides an easier implementation and is generally safe for jobs that complete in a reasonable amount of time and servers and not in a “dangerous environment”. One of the main problems of grid computing, which Marlon will be covering in a later lecture, is that this is does not hold. MPI programs consist of a single executable that runs on all participating nodes. Somtimes, the same instructions are carried on different data, other times different instructions are carried out on the data. Process determines its role from program logic Master node is the entry point Core commands: Init, Send, Receive, Finalize Parallel Processing - GCB
Parallel Processing - GCB Communication Types Standard Synchronous (blocking send) Ready Buffered (asynchronous) For non-blocking communication: MPI_Wait – block until receive MPI_Test - true/false At the heart of MPI, is message passing. In other words sending (and receiving) messages. Here we begin to see the flexibility provided by MPI. It defines a standard communication type, we may be synchronous or asynchronous, the underlying implementation tries to make the best decision. Synchronous communication requires the call to block until a “receive” is snet from the destination node Ready mode assumes that the destination node is ready. So it will complete even if the receiver was not ready, which could be dangerous Buffered mode makes a copy of the message to a local buffer, so that it can execute when ready. With nonblocking calls, other work could be performed while the message is transfering. To sort of convert them to blocking calls, MPI_WAit (or one of its variants) can be used. To Test if a transfer has been completed, MPI_TEST (or one of its variants) may be used. Parallel Processing - GCB
Parallel Processing - GCB Message Structure Data Length Data Type Data Length Data Type Variable Name Data Send Recv Destination Status Communication context Tag Communication context Tag Naturally, for things like Send/Receive and Wait/Test to work, there needs to be a way of identifying messages. Various parameters related to the data being transferred must be specified. Also, in order to differentiate messages, a tag needs to be issued. Since tags are user-generated and collisions are possible, There is a need for contexts as well. Contexts are system-generated. Parallel Processing - GCB
Data Types and Functions Uses its own types for consistency MPI_INT, MPI_CHAR, etc. All Functions prefixed with “MPI_” MPI_Init, MPI_Send, MPI_Recv, etc. Parallel Processing - GCB
Our First Program: Numerical Integration Objective: Calculate area under f(x) = x2 Outline: Define variables Initialize MPI Determine subset of program to calculate Perform Calculation Collect Information (at Master) Send Information (Slaves) Finalize Problem: Determine the area under the curve f(x)=x^2, between x = [2,5], using a 50 rectangle resolution Parallel Processing - GCB
Parallel Processing - GCB Our First Program Download Link: http://www.fiu.edu/~jdelga06/integration.c Parallel Processing - GCB
Variable Declarations #include "mpi.h" #include <stdio.h> /* problem parameters */ #define f(x) ((x) * (x)) #define numberRects 50 #define lowerLimit 2.0 #define upperLimit 5.0 int main( int argc, char * argv[] ) { /* MPI variables */ int dest, noProcesses, processId, src, tag; MPI_Status status; /* problem variables */ int i; double area, x, height, lower, width, total, range; ... Parallel Processing - GCB
Variable Declarations #include "mpi.h" #include <stdio.h> /* problem parameters */ #define f(x) ((x) * (x)) #define numberRects 50 #define lowerLimit 2.0 #define upperLimit 5.0 int main( int argc, char * argv[] ) { /* MPI variables */ int dest, noProcesses, processId, src, tag; MPI_Status status; /* problem variables */ int i; double area, x, height, lower, width, total, range; ... Parallel Processing - GCB
Variable Declarations #include "mpi.h" #include <stdio.h> /* problem parameters */ #define f(x) ((x) * (x)) #define numberRects 50 #define lowerLimit 2.0 #define upperLimit 5.0 int main( int argc, char * argv[] ) { /* MPI variables */ int dest, noProcesses, processId, src, tag; MPI_Status status; /* problem variables */ int i; double area, x, height, lower, width, total, range; ... Parallel Processing - GCB
Variable Declarations #include "mpi.h" #include <stdio.h> /* problem parameters */ #define f(x) ((x) * (x)) #define numberRects 50 #define lowerLimit 2.0 #define upperLimit 5.0 int main( int argc, char * argv[] ) { /* MPI variables */ int dest, noProcesses, processId, src, tag; MPI_Status status; /* problem variables */ int i; double area, x, height, lower, width, total, range; ... Parallel Processing - GCB
Parallel Processing - GCB MPI Initialization int main( int argc, char * argv[] ) { ... MPI_Init(&argc, &argv); MPI_Comm_size(MPI_COMM_WORLD, &noProcesses); MPI_Comm_rank(MPI_COMM_WORLD, &processId); This is the same main, I'm just copying again so you realize we are still in it Parallel Processing - GCB
Parallel Processing - GCB Calculation int main( int argc, char * argv[] ) { ... /* adjust problem size for subproblem*/ range = (upperLimit - lowerLimit) / noProcesses; width = range / numberRects; lower = lowerLimit + range * processId; /* calculate area for subproblem */ area = 0.0; for (i = 0; i < numberRects; i++) { x = lower + i * width + width / 2.0; height = f(x); area = area + width * height; } Parallel Processing - GCB
Parallel Processing - GCB Sending and Receiving int main( int argc, char * argv[] ) { ... tag = 0; if (processId == 0) /* MASTER */ { total = area; for (src=1; src < noProcesses; src++) { MPI_Recv(&area, 1, MPI_DOUBLE, src, tag, MPI_COMM_WORLD, &status); total = total + area; } fprintf(stderr, "The area from %f to %f is: %f\n", lowerLimit, upperLimit, total ); else /* WORKER (i.e. compute node) */ { dest = 0; MPI_Send(&area, 1, MPI_DOUBLE, dest, tag, MPI_COMM_WORLD); }; Using “0” as the destination is good since there will always be a processor with rank of 0. If you are going to be testing code in a single-processor system, as is often the case, this is especially applicable. Parallel Processing - GCB
Parallel Processing - GCB Finalizing int main( int argc, char * argv[] ) { ... MPI_Finalize(); return 0; } Parallel Processing - GCB
Parallel Processing - GCB Communicators MPI_COMM_WORLD – All processes involved What if different workers have different tasks? MPI_COMM_WORLD is the default simple example: you have one process that acts as a random number generator that distributes unique numbers to the other nodes. The MASTER node sends the compute tasks to the rest of the nodes. In this case, you could have a communicator called “WORKER”. When an MPI call is given WORKER as the communicator, only the “WORKER” processes will be involved. Parallel Processing - GCB
Parallel Processing - GCB Additional Functions Data Management MPI_Bcast (broadcast) Collective Computation Min, Max, Sum, AND, etc. Benefits: Abstraction Optimized As mentioned earlier, many complete MPI programs can be created with the 6 basic functions. However, for optimal performance and development time, it is sometimes necessary to use other functions. These functions still use send and receive internally, but provide abstraction. Also, they are interanlly optimized for performance. I can't go over everything, here but these are a couple. A typical example is Data management, and the most common one is the broadcast message. This is used to send something to all participating nodes. For example, a constant variable Another example is a collective computation function. For example, if you calculate different subsets of a problem at different nodes and need to get the sum of them all, a sum function is provided. Parallel Processing - GCB Source: http://www.pdc.kth.se
Parallel Processing - GCB Typical Problems Designing Debugging Scalability The first two are existing problems in computer science in general. The fact that you are dealing with a distributed environment merely makes them even bigger problems. Scalability is the new problem. Since the programs must deal with communication problems, it is usually difficult to increase the computation time in a nearly-linear fashion Parallel Processing - GCB
Parallel Processing - GCB Scalability Analysis Definition: Estimation of resource (computation and computation) requirements of a program as problem size and/or number of processors increases Require knowledge of communication time Assume otherwise idle nodes Ignore data requirements of node When performing scalability analysis, we need knowledge of the propogation time of messages in order to make an estimate. Also, we assume that the nodes are not performing any other computation or communication. In other words, 100 percent of their resources are devoted to the task at hand. Lastly, we ignore the fact that as problem size increases, the likelihood of having to use virtual memory does also, which can have a profound effect on computation time. Parallel Processing - GCB
Simple Scalability Example Tcomm = Time to send a message Tcomm = s + rn s = start-up time r = time to send a single byte (i.e. 1/bandwidth) n = size of the data type (int, double, etc.) Parallel Processing - GCB
Simple Scalability Example Matrix Multiplication of two square matrices of size (N x N). First Matrix is broadcasted to all nodes Cost for the rest Computation n multiplications and (n – 1) additions per cell n2 x (2n – 1) = 2n3 -n2 floating point operations Communication Send n elements to worker node, and return the resulting n elements to the master node (2n) After doing this for each column in the result matrix: n x 2n Parallel Processing - GCB
Simple Scalability Example Therefore, we get the following ratio of communication to computation As n becomes very large, the ratio approaches 1/n. So this problem is not severely affected by communication overhead Parallel Processing - GCB
Parallel Processing - GCB References http://nf.apac.edu.au/training/MPIProg/mpi- slides/allslides.html High Performance Linux Clusters. By Joseph D. Sloan. O'Reilly Press. Using MPI, second edition. By Gropp, Lusk, and Skjellum. MIT Press. Parallel Processing - GCB