Hybrid Parallel Programming Introduction
Interconnection Network Hybrid Systems Since most computers are multi-core, most clusters have both shared-memory and distributed-memory. Interconnection Network Core Memory Multi-core Computer Core Memory Multi-core Computer Core Memory Multi-core Computer Core Memory Multi-core Computer
Hybrid Parallel Computing We can use MPI to run processes concurrently on each computer We can use OpenMP to run threads concurrently on each core of a computer Advantage: we can make use of shared-shared memory where communication is required Why? – Because inter-computer communication is an order of magnitude slower than synchronization
“More effective forms of parallelism need to be investigated and perhaps utilized within the node for certain applications to maximize efficiency.” [Holt:2011] MPI implementation designers have recognized this notion stating that MPI alone “does not make the most efficient use of the shared resources within the node of a HPC system.” [Dózsa:2010]
Research conducted at the University of Tokyo on three-level hybrid parallel programs running on a large SMP cluster determined that it was inconclusive whether the extra development effort outweighed the perceived benefit [Nakajima:2005] Hybrid parallel programs using MPI and OpenMP have been developed mainly in the past decade with mixed results comparing performance relative to MPI-only parallel versions.[Henty:2000], Chow & Hysom:2001]
Matrix Multiplication, C = A * B where A is an n x l matrix and B is an l x m matrix.
Matrix Multiplication Again Each cell Ci,j can be computed independently of the other elements of the C matrix If we are multiplying to matrices that are NxN, then we can use up to N2 processors without any communication But the computation of Ci,j is a dot-product of two arrays. In other words, it is a reduction
Matrix Multiplication Again If we have more than N2 processors, then we need to divide up the work of the reduction among processors The reduction requires communication Although we can do a reduction using MPI, the communication is much slower than doing a reduction in OpenMP However, N usually needs to be really big to justify parallel computation, and how likely are we to have N3 processors available!
How is this possible? OpenMP is supported by icc and gcc compilers: gcc –fopenmp <file.c> icc –openmp <file.c> MPI is a library that is linked with your C program: mpicc <file.c> Mpicc uses gcc linked with the appropriate libraries
How is this possible? So to use both MPI and OpenMP: mpicc –fopenmp <file.c> Mpicc is simply a script
Simple Example int main(int argc, char *argv[]) { int i, j, blksz, rank, NP, tid; char *usage = "Usage: %s \n"; FILE *fd; char message[80]; MPI_Init (&argc, &argv); MPI_Comm_size (MPI_COMM_WORLD, &NP); MPI_Comm_rank (MPI_COMM_WORLD, &rank); blksz = (int) ceil (((double) N)/NP);
Simple Example Loop i parallelized across computers #pragma omp parallel private (tid, i, j) { tid = omp_get_thread_num(); for (i = rank*blksz; i < min((rank + 1) * blksz, N); i++) { #pragma omp for for (j = 0; j < N; j++) { printf ("rank %d, thread %d: executing loop iteration i=%d j=%d\n", rank, tid, i, j); } Loop i parallelized across computers Loop j parallelized across threads
Simple Example Result rank 0, thread 4: executing loop iteration i=0 j=4 rank 0, thread 2: executing loop iteration i=0 j=2 rank 0, thread 1: executing loop iteration i=0 j=1 rank 1, thread 0: executing loop iteration i=2 j=0 rank 1, thread 4: executing loop iteration i=2 j=4 rank 1, thread 2: executing loop iteration i=2 j=2 rank 1, thread 3: executing loop iteration i=2 j=3 rank 0, thread 0: executing loop iteration i=0 j=0 rank 1, thread 1: executing loop iteration i=2 j=1 rank 0, thread 3: executing loop iteration i=0 j=3 rank 2, thread 2: executing loop iteration i=4 j=2 rank 2, thread 0: executing loop iteration i=4 j=0 rank 2, thread 3: executing loop iteration i=4 j=3
Simple Example Result rank 2, thread 4: executing loop iteration i=4 j=4 rank 2, thread 1: executing loop iteration i=4 j=1 rank 0, thread 2: executing loop iteration i=1 j=2 rank 0, thread 4: executing loop iteration i=1 j=4 rank 0, thread 3: executing loop iteration i=1 j=3 rank 0, thread 0: executing loop iteration i=1 j=0 rank 0, thread 1: executing loop iteration i=1 j=1 rank 1, thread 0: executing loop iteration i=3 j=0 rank 1, thread 2: executing loop iteration i=3 j=2 rank 1, thread 3: executing loop iteration i=3 j=3 rank 1, thread 1: executing loop iteration i=3 j=1 rank 1, thread 4: executing loop iteration i=3 j=4
Back to Matrix Multiplication MPI_Init (&argc, &argv); MPI_Comm_size (MPI_COMM_WORLD, &NP); MPI_Comm_rank (MPI_COMM_WORLD, &rank); blksz = (int) ceil (((double) N)/NP); MPI_Scatter (a, N*blksz, MPI_FLOAT, a, N*blksz, MPI_FLOAT, 0, MPI_COMM_WORLD); MPI_Bcast (b, N*N, MPI_FLOAT, 0, MPI_COMM_WORLD);
Back to Matrix Multiplication #pragma omp parallel private (tid, i, j, k) { for (i = 0; i < blksz && rank * blksz < N; i++) { #pragma omp for nowait for (j = 0; j < N; j++) { c[i][j] = 0.0; for (k = 0; k < N; k++) { c[i][j] += a[i][k] * b[k][j]; }
Matrix Multiplication Results $ diff out MMULT.o5356 1c1 < elapsed_time= 1.525183 (seconds) --- >elapsed_time= 0.659652 (seconds) $ diff out MMULT.o5357 > elapsed_time= 0.626821 (seconds) $ Sequential Execution Time Hybrid Execution Time MPI-only Execution Time Hybrid did not do better than MPI only
Back to Matrix Multiplication #pragma omp parallel private (tid, i, j, k) { #pragma omp for nowait for (i = 0; i < blksz && rank * blksz < N; i++) { for (j = 0; j < N; j++) { c[i][j] = 0.0; for (k = 0; k < N; k++) { c[i][j] += a[i][k] * b[k][j]; } Perhaps we could do better parallizing the i loop both with MPI and OpenMP But this loop is too complicated for OpenMP
Back to Matrix Multiplication #pragma omp parallel private (tid, i, j,k) { #pragma omp for nowait for (i = 0; i < blksz; i++) { if (rank * blksz < N) { for (j = 0; j < N; j++) { c[i][j] = 0.0; for (k = 0; k < N; k++) { c[i][j] += a[i][k] * b[k][j]; } An if statement can simplify the loop
Matrix Multiplication Results $ diff out MMULT.o5356 1c1 < elapsed_time= 1.525183 (seconds) --- >elapsed_time= 0.688119 (seconds) Sequential Execution Time Hybrid Execution Time Still not better
The Paraguin compiler can also create hybrid programs This work was done by graduate student L. Kyle Holt, 2011. This is because it uses mpicc, it will pass the OpenMP pragma through This works unless you are trying to parallelize the same loop using
Hybrid Matrix Multiplication using Paraguin ; #pragma paraguin begin_parallel #pragma paraguin forall C p i j k \ 0x0 -1 1 0x0 0x0 \ 0x0 1 -1 0x0 0x0 #pragma paraguin bcast a b #pragma paraguin gather 1 C i j k \ 0x0 0x0 0x0 1 \ 0x0 0x0 0x0 -1 We are parallelizing the i loop
Hybrid Matrix Multiplication using Paraguin #pragma omp parallel for private(__guin_p,i,j,k) schedule(static) num_threads(4) for (i = 0; i < N; i++) { for (j = 0; j < N; j++) { c[i][j] = 0.0; for (k = 0; k < N; k++) { c[i][j] = c[i][j] + a[i][k] * b[k][j]; } ; #pragma paraguin end_parallel Since the i loop is going to be parallelized using both MPI and OpenMP, the compiler has to modify the omp pragma
Resulting Matrix Multiplication Parallel Program produced by Paraguin if (0 <= __guin_mypid & __guin_mypid <= -1 + 1 * __guin_NP) { #pragma omp parallel for private ( __guin_p , i , j , k ) schedule ( static ) num_threads ( 4 ) for (__guin_p = __guin_blksz * __guin_mypid; __guin_p <= __suif_min(799, __guin_blksz + -1 + __guin_blksz * __guin_mypid); __guin_p++) { i = 1 * __guin_p; for (j = 0; j <= 799; j++) { c[i][j] = 0.0F; for (k = 0; k <= 799; k++) { c[i][j] = c[i][j] + a[i][k] * b[k][j]; } Since the i loop has been replaced with a p loop, the pragma has been modified
Matrix Multiplication (1000x1000 matrices) Code generated by Paraguin Unable to scale Sequential or OpenMP
Matrix Multiplication Code generated by Paraguin 4 PEs 8 PEs 12 PEs 16 PEs 20 PEs 24 PEs 28 PEs 32 PEs
Hybrid Sobel Edge Detection using Paraguin ; #pragma paraguin begin_parallel #pragma paraguin forall C p x y i j \ 0x0 -1 1 0x0 0x0 0x0 \ 0x0 1 -1 0x0 0x0 0x0 #pragma paraguin bcast grayImage #pragma paraguin bcast w #pragma paraguin bcast h We are parallelizing the x loop
Hybrid Sobel Edge Detection using Paraguin #pragma omp parallel for private(__guin_p,x,y,i,j,sumx, sumy,sum) shared(w,h) num_threads(4) for(x=0; x < N; ++x) { for(y=0; y < N; ++y) { sumx = 0; sumy = 0; // handle image boundaries . . . edgeImage[x][y] = clamp(sum); } ; #pragma paraguin end_parallel
Resulting Sobel Edge Detection Parallel Program produced by Paraguin … __guin_blksz = __suif_divceil(999 - 0 + 1, __guin_NP); if (0 <= __guin_mypid & __guin_mypid <= -1 + 1 * __guin_NP) { #pragma omp parallel for private ( __guin_p , x , y , i , j , sumx , sumy , sum ) shared ( w , h ) num_threads ( 4 ) //Local variable produced by s2c __s2c_tmp = __suif_min(999, __guin_blksz + -1 + __guin_blksz * __guin_mypid); //omp pragma moved here for (__guin_p = __guin_blksz * __guin_mypid; __guin_p <= __s2c_tmp; __guin_p++) We had to move the omp pragma by hand because of the temporary variable
Sobel Edge Detection 2.5 MP grayscale 8-bit image MPI-Only/Hybrid Code generated by Paraguin
We found that the majority of the time spent was in gathering the partial results If we remove the gather time from the execution time, we find…
Approx. workload time in Sobel Edge Detection of 2.5 MP image
Questions