Parallel Programming Styles and Hybrids Parallel Computing 2 Parallel Programming Styles and Hybrids
Objectives Discuss major classes of parallel programming models Hybrid programming Examples
Introduction Parallel computer architectures have evolved and so has the programming styles needed to effectively use these architectures Two styles have become defacto standards MPI library for message passing and OpenMP compiler directives for multithreading Both are widely used and are virtually on every parallel system. With current parallel systems it makes more sense to mix multithreading and message passing to maximize performance
HPC Architectures Based on memory distribution Shared – all processors share equal access to one or more banks of memory CRAY-YMP, SGI challenge, dual and quad workstations Distributed – each processor has its own memory which may or may not be visible to other processors IBM SP2 and clusters of single uniprocessor machines
Distributed shared memory NUMA (non-uniform memory access) SGI origin 3000, HP superdome Cluster of SMP (shared memory system) IBM SP, Beowulf clusters
Parallel Programming Styles Explicit threading Not commonly used on distributed systems Uses locks, semaphores, and mutexes Synchronization and parallelization handled by programmer POSIX threads (pthreads library)
Message passing interface (MPI) Application consists of several processes Communicates by passing data to one another (send/receive) (broadcast/gather) Synchronization is still require of the programmer, however, locking is not since nothing is shared Common approach is domain decomposition where each task is assigned a subset and communicates it edge values to neighbouring subdomains
Compiler directives (OpenMP) Special comments are added to serial programs in parallelizable regions Requires a compiler that understands the special directives Locking and synchronization handled by the compiler unless overwritten by directives (implicit and explicit) Decomposition is done primarily by the programmer Scalability is limited than that of MPI applications due to lesser amount of control the programmer has over how the code is parallelized
Hybrid Mixture of MPI and OpenMP Used on distributed shared memory systems Applications are usually consists of computationally expensive loops punctuated by calls to MPI. These loops, in many cases, can be further parallelized by adding OpenMP directives Not a solution to all parallel programs but quite suitable for certain algorithms
Why Hybrid? Performance considerations Scalability. For a fixed problem size, hybrid code will scale to higher processor counts before being overwhelmed by communication overhead A good example is the Laplace equation May not be effective where performance is limited by the speed of interconnect rather than the processor
Computer architecture Some architectural limitations force the use of hybrid computing (i.E. MPI process limits per 8 or cluster blocks) Some algorithms – namely FFT’s run better on machines where the local bandwidth is much greater than that of the network due to the O(N²) behaviour of the bandwidth required. With a hybrid approach the number of MPI processes can be lowered while retaining the same number of processors used
Algorithms Some algorithms such as computational fluid dynamics benefit greatly from a hybrid approach. The solution space is separated into interconnecting zones. The interaction between zones is handled by MPI while the fine grained computations required inside a zone are handled by OpenMP
Considerations on MPI, OpenMP, and Hybrid Styles General considerations Amdahl’s law Amdahl’s law states that the speedup through parallelization is limited by the portion of the serial code that cannot be parallelized. But in a hybrid program, if the fraction of MPI processes parallelized by OpenMP is not high, then the overall speedup is limited
Communication patterns How do the programs communication needs match the underlining hardware? It might increase performance if a hybrid approach is used where MPI code leads to rapid growth in communication traffic
Machine balance How does memory, cpu, and interconnect affect the performance of a program? If the processors are fast then communications might be a problem. Or if the primary cache is accessed differently in clusters (older machines in a Beowulf cluster)
Memory access patterns Cache memory has to be effectively used in order to achieve better performance in clusters (i.e. primary, secondary, tertiary)
Advantages and Disadvantages of OpenMP Comparatively easy to implement. In particular, it is easy to refit an existing serial code for parallel execution Same source code can be used for both parallel and serial versions More natural for shared-memory architectures Dynamic scheduling (load balancing is easier than with MPI) Useful for both fine and course grained problems
Disadvantages Can only run on shared memory systems Limits the number of processors that can be used Data placement and locality may become serious issues Especially true for SGI NUMA architectures where the cost of remote memory access may be high Thread creation overhead can be significant unless enough work is performed in each parallel loop Implementing course-grained solutions in OpenMP is usually about as involved as constructing the analogous MPI application Explicit synchronization is required
General characteristics Most effective for problems with fine-grain parallelism (i.E. Loop-level) Can also be used for course-grained parallelism Overall intra-node memory bandwidth may limit the number of processors that can effectively be used Each thread sees the same global memory, but has its own private memory Implicit messaging High level of abstraction (higher than MPI)
Advantages and Disadvantages of MPI Any parallel algorithm can be expressed in terms of the MPI paradigm Runs on both distributed and shared-memory systems. Performance is generally good in either environment Allows explicit control over communication, leading to high efficiency due to overlapping communication and computation Allows for static task handling Data placement problems are rarely observed For suitable problems MPI scales well to very large numbers of processors MPI is portable Current implementations are efficient and optimized
Disadvantages Application development is difficult. Re-fitting existing serial code using MPI is often a major undertaking, requiring extensive restructuring of the serial code It is less useful with fine-grained problems where communication costs may dominate For all-to-all type operations, the effective number of point-to-point interactions increases as the square of the number of processors – resulting in rapidly increasing communication costs Dynamic load balancing is difficult to implement Variations exist in different manufacturers implementation of the entire MPI library. Some may not implement all the calls, while others offer extensions
General characteristics MPI is most effective for problems with “course-grained” parallelism, for which The problem decomposes into quasi-independent pieces and Communication needs are minimized
The Best of Both Worlds Use hybrid programming when The code exhibits limited scaling with MPI The code could make use of dynamic load balancing The code exhibits fine-grained or a combination of both fine-grained and course-grained parallelism The application makes use of replicated data
Problems When Mixing Modes Environment variables may not be passed correctly to the remote MPI processes. This has negative implications for hybrid jobs because each MPI process needs to access the MP_SET_NUMTHREADS environment variable in order to start up the proper number of OpenMP threads. It can be solved by always setting the number of OpenMP threads within the code
Calling MPI communication functions within OpenMP parallel regions Hybrid programming works with having OpenMP threads spawned from MPI processes. It does not work the other way. It will result in a runtime error
Laplace Example Outline Serial MPI OpenMP Hybrid
Outline The Laplace equation in two dimensions is also know as the potential equation and is usually one of the first PDE’s (partial differential equations) encountered c²(²u/x² + ²u/y²) = 0 Governing equation for electrostatics, heat diffusion, and fluid flow. By adding a function f(x,y) we get Poisson’s equation, a first derivative in time – we get the diffusion equation and by adding a second derivative in time we get the wave equation A numerical solution to this PDE can be computed by using a finite difference based approach
Using an iterative method to solve the equation we get the following: du[sup(n+1)sub(ij)] = (u[sup(n)sub(I-1j)] + u[sup(n)sub(I+1j] + u[sup(n)sub(ij-1)] + u[sup(n)sub(ij+1)]) / 4- u[sup(n)sub(ij)] u[sup(n+1)sub(ij)] = u[sup(n)sub(ij) + du[sup(n+1)sub(ij)] *note* - n represents iteration number , not an exponent
Serial – cache-friendly approach incrementally computes du values and compares it with the current maximum, then updates all u values. Usually can be done without any additional memory operations – good for clusters See code
MPI Currently the most utilized method for distributed memory systems Note processes are not the same as processors. An MPI process can be thought of as a thread and multiple threads can run on a single processor. The system is responsible for mapping the MPI processes to physical processors Each process is an exact copy of the program with the exception that each copy has its own unique id
Hello World PROGRAM hello INCLUDE ’mpi.f’ INTEGER ierror, rank, size CALL MPI_INIT(ierror) CALL MPI_COMM_RANK(MPI_COMM_WORLD, rank, ierror) CALL MPI_COMM_SIZE(MPI_COM_WORLD, size, ierror) if (rank .EQ. 2) print *, ’P:’, rank, ’ Hello World’ print *, ’I have rank ‘ , rank, ’ out of’, size CALL MPI_FINALIZE(ierror) END #include <mpi.h> #include <stdio.h> void main(int argc, char *argv[]) { int rank, size; MPI_Init(&argc, &argv); MPI_Comm_rank(MPI_COMM_WORLD, &rank); if(rank==2) printf ("P:%d Hello World\n",rank); MPI_Comm_size(MPI_COMM_WORLD, &size); printf("I am %d out of %d.\n", rank, size); MPI_Finalize(); }
OpenMP OpenMP is a tool for writing multi-threaded applications in a shared memory environment. It consists of a set of compiler directives and library routines. The compiler generates multi-threaded code based on the specified directives. OpenMP is essentially a standardization of the last 18 years or so of SMP (Symmetric Multi-Processor) development and practice See code
Hybrid Remember you are running both MPI and OpenMP so “f90 –O3 –mp file.f90 –lmpi” See code