Hybrid Parallel Programming

Hybrid Parallel Programming
Introduction

Interconnection Network
Hybrid Systems Since most computers are multi-core, most clusters have both shared-memory and distributed-memory. Interconnection Network Core Memory Multi-core Computer Core Memory Multi-core Computer Core Memory Multi-core Computer Core Memory Multi-core Computer

Hybrid Parallel Computing
We can use MPI to run processes concurrently on each computer We can use OpenMP to run threads concurrently on each core of a computer Advantage: we can make use of shared-shared memory where communication is required Why? – Because inter-computer communication is an order of magnitude slower than synchronization

“More effective forms of parallelism need to be investigated and perhaps utilized within the node for certain applications to maximize efficiency.” [Holt:2011] MPI implementation designers have recognized this notion stating that MPI alone “does not make the most efficient use of the shared resources within the node of a HPC system.” [Dózsa:2010]

Research conducted at the University of Tokyo on three-level hybrid parallel programs running on a large SMP cluster determined that it was inconclusive whether the extra development effort outweighed the perceived benefit [Nakajima:2005] Hybrid parallel programs using MPI and OpenMP have been developed mainly in the past decade with mixed results comparing performance relative to MPI-only parallel versions.[Henty:2000], Chow & Hysom:2001]

Matrix Multiplication, C = A * B
where A is an n x l matrix and B is an l x m matrix.

Matrix Multiplication Again
Each cell Ci,j can be computed independently of the other elements of the C matrix If we are multiplying to matrices that are NxN, then we can use up to N2 processors without any communication But the computation of Ci,j is a dot-product of two arrays. In other words, it is a reduction

Matrix Multiplication Again
If we have more than N2 processors, then we need to divide up the work of the reduction among processors The reduction requires communication Although we can do a reduction using MPI, the communication is much slower than doing a reduction in OpenMP However, N usually needs to be really big to justify parallel computation, and how likely are we to have N3 processors available!

How is this possible? OpenMP is supported by icc and gcc compilers:
gcc –fopenmp <file.c> icc –openmp <file.c> MPI is a library that is linked with your C program: mpicc <file.c> Mpicc uses gcc linked with the appropriate libraries

How is this possible? So to use both MPI and OpenMP:
mpicc –fopenmp <file.c> Mpicc is simply a script

Simple Example int main(int argc, char *argv[]) { int i, j, blksz, rank, NP, tid; char *usage = "Usage: %s \n"; FILE *fd; char message[80]; MPI_Init (&argc, &argv); MPI_Comm_size (MPI_COMM_WORLD, &NP); MPI_Comm_rank (MPI_COMM_WORLD, &rank); blksz = (int) ceil (((double) N)/NP);

Simple Example Loop i parallelized across computers
#pragma omp parallel private (tid, i, j) { tid = omp_get_thread_num(); for (i = rank*blksz; i < min((rank + 1) * blksz, N); i++) { #pragma omp for for (j = 0; j < N; j++) { printf ("rank %d, thread %d: executing loop iteration i=%d j=%d\n", rank, tid, i, j); } Loop i parallelized across computers Loop j parallelized across threads

Simple Example Result rank 0, thread 4: executing loop iteration i=0 j=4 rank 0, thread 2: executing loop iteration i=0 j=2 rank 0, thread 1: executing loop iteration i=0 j=1 rank 1, thread 0: executing loop iteration i=2 j=0 rank 1, thread 4: executing loop iteration i=2 j=4 rank 1, thread 2: executing loop iteration i=2 j=2 rank 1, thread 3: executing loop iteration i=2 j=3 rank 0, thread 0: executing loop iteration i=0 j=0 rank 1, thread 1: executing loop iteration i=2 j=1 rank 0, thread 3: executing loop iteration i=0 j=3 rank 2, thread 2: executing loop iteration i=4 j=2 rank 2, thread 0: executing loop iteration i=4 j=0 rank 2, thread 3: executing loop iteration i=4 j=3

Simple Example Result rank 2, thread 4: executing loop iteration i=4 j=4 rank 2, thread 1: executing loop iteration i=4 j=1 rank 0, thread 2: executing loop iteration i=1 j=2 rank 0, thread 4: executing loop iteration i=1 j=4 rank 0, thread 3: executing loop iteration i=1 j=3 rank 0, thread 0: executing loop iteration i=1 j=0 rank 0, thread 1: executing loop iteration i=1 j=1 rank 1, thread 0: executing loop iteration i=3 j=0 rank 1, thread 2: executing loop iteration i=3 j=2 rank 1, thread 3: executing loop iteration i=3 j=3 rank 1, thread 1: executing loop iteration i=3 j=1 rank 1, thread 4: executing loop iteration i=3 j=4

Back to Matrix Multiplication
MPI_Init (&argc, &argv); MPI_Comm_size (MPI_COMM_WORLD, &NP); MPI_Comm_rank (MPI_COMM_WORLD, &rank); blksz = (int) ceil (((double) N)/NP); MPI_Scatter (a, N*blksz, MPI_FLOAT, a, N*blksz, MPI_FLOAT, 0, MPI_COMM_WORLD); MPI_Bcast (b, N*N, MPI_FLOAT, 0, MPI_COMM_WORLD);

#pragma omp parallel private (tid, i, j, k) { for (i = 0; i < blksz && rank * blksz < N; i++) { #pragma omp for nowait for (j = 0; j < N; j++) { c[i][j] = 0.0; for (k = 0; k < N; k++) { c[i][j] += a[i][k] * b[k][j]; }

Matrix Multiplication Results
$ diff out MMULT.o5356 1c1 < elapsed_time= (seconds) --- >elapsed_time= (seconds) $ diff out MMULT.o5357 > elapsed_time= (seconds) $ Sequential Execution Time Hybrid Execution Time MPI-only Execution Time Hybrid did not do better than MPI only

#pragma omp parallel private (tid, i, j, k) { #pragma omp for nowait for (i = 0; i < blksz && rank * blksz < N; i++) { for (j = 0; j < N; j++) { c[i][j] = 0.0; for (k = 0; k < N; k++) { c[i][j] += a[i][k] * b[k][j]; } Perhaps we could do better parallizing the i loop both with MPI and OpenMP But this loop is too complicated for OpenMP

#pragma omp parallel private (tid, i, j,k) { #pragma omp for nowait for (i = 0; i < blksz; i++) { if (rank * blksz < N) { for (j = 0; j < N; j++) { c[i][j] = 0.0; for (k = 0; k < N; k++) { c[i][j] += a[i][k] * b[k][j]; } An if statement can simplify the loop

Matrix Multiplication Results
$ diff out MMULT.o5356 1c1 < elapsed_time= (seconds) --- >elapsed_time= (seconds) Sequential Execution Time Hybrid Execution Time Still not better

The Paraguin compiler can also create hybrid programs
This work was done by graduate student L. Kyle Holt, 2011. This is because it uses mpicc, it will pass the OpenMP pragma through This works unless you are trying to parallelize the same loop using

Hybrid Matrix Multiplication using Paraguin
; #pragma paraguin begin_parallel #pragma paraguin forall C p i j k \ 0x x0 0x0 \ 0x x0 0x0 #pragma paraguin bcast a b #pragma paraguin gather 1 C i j k \ 0x0 0x0 0x0 1 \ 0x0 0x0 0x0 -1 We are parallelizing the i loop

Hybrid Matrix Multiplication using Paraguin
#pragma omp parallel for private(__guin_p,i,j,k) schedule(static) num_threads(4) for (i = 0; i < N; i++) { for (j = 0; j < N; j++) { c[i][j] = 0.0; for (k = 0; k < N; k++) { c[i][j] = c[i][j] + a[i][k] * b[k][j]; } ; #pragma paraguin end_parallel Since the i loop is going to be parallelized using both MPI and OpenMP, the compiler has to modify the omp pragma

Resulting Matrix Multiplication Parallel Program produced by Paraguin
if (0 <= __guin_mypid & __guin_mypid <= * __guin_NP) { #pragma omp parallel for private ( __guin_p , i , j , k ) schedule ( static ) num_threads ( 4 ) for (__guin_p = __guin_blksz * __guin_mypid; __guin_p <= __suif_min(799, __guin_blksz __guin_blksz * __guin_mypid); __guin_p++) { i = 1 * __guin_p; for (j = 0; j <= 799; j++) { c[i][j] = 0.0F; for (k = 0; k <= 799; k++) { c[i][j] = c[i][j] + a[i][k] * b[k][j]; } Since the i loop has been replaced with a p loop, the pragma has been modified

Matrix Multiplication (1000x1000 matrices) Code generated by Paraguin
Unable to scale Sequential or OpenMP

Matrix Multiplication Code generated by Paraguin
4 PEs 8 PEs 12 PEs 16 PEs 20 PEs 24 PEs 28 PEs 32 PEs

Hybrid Sobel Edge Detection using Paraguin
; #pragma paraguin begin_parallel #pragma paraguin forall C p x y i j \ 0x x0 0x0 0x0 \ 0x x0 0x0 0x0 #pragma paraguin bcast grayImage #pragma paraguin bcast w #pragma paraguin bcast h We are parallelizing the x loop

Hybrid Sobel Edge Detection using Paraguin
#pragma omp parallel for private(__guin_p,x,y,i,j,sumx, sumy,sum) shared(w,h) num_threads(4) for(x=0; x < N; ++x) { for(y=0; y < N; ++y) { sumx = 0; sumy = 0; // handle image boundaries edgeImage[x][y] = clamp(sum); } ; #pragma paraguin end_parallel

Resulting Sobel Edge Detection Parallel Program produced by Paraguin
… __guin_blksz = __suif_divceil( , __guin_NP); if (0 <= __guin_mypid & __guin_mypid <= * __guin_NP) { #pragma omp parallel for private ( __guin_p , x , y , i , j , sumx , sumy , sum ) shared ( w , h ) num_threads ( 4 ) //Local variable produced by s2c __s2c_tmp = __suif_min(999, __guin_blksz __guin_blksz * __guin_mypid); //omp pragma moved here for (__guin_p = __guin_blksz * __guin_mypid; __guin_p <= __s2c_tmp; __guin_p++) We had to move the omp pragma by hand because of the temporary variable

Sobel Edge Detection 2.5 MP grayscale 8-bit image MPI-Only/Hybrid Code generated by Paraguin

We found that the majority of the time spent was in gathering the partial results
If we remove the gather time from the execution time, we find…

Approx. workload time in Sobel Edge Detection of 2.5 MP image

Questions

Hybrid Parallel Programming

Similar presentations

Presentation on theme: "Hybrid Parallel Programming"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Hybrid Parallel Programming

Similar presentations

Presentation on theme: "Hybrid Parallel Programming"— Presentation transcript:

Similar presentations

About project

Feedback