Hybrid Parallel Programming

Hybrid Parallel Programming
Introduction

Interconnection Network
Hybrid Systems Since most computers are multi-core, most clusters have both shared-memory and distributed-memory. Interconnection Network Core Memory Multi-core Computer Core Memory Multi-core Computer Core Memory Multi-core Computer Core Memory Multi-core Computer

Hybrid Parallel Computing
We can use MPI to run processes concurrently on each computer We can use OpenMP to run threads concurrently on each core of a computer Advantage: we can make use of shared- memory where communication is required Why? – Because inter-computer communication is an order of magnitude slower than synchronization

“More effective forms of parallelism need to be investigated and perhaps utilized within the node for certain applications to maximize efficiency.” [Holt:2011] MPI implementation designers have recognized this notion stating that MPI alone “does not make the most efficient use of the shared resources within the node of a HPC system.” [Dózsa:2010]

Research conducted at the University of Tokyo on three-level hybrid parallel programs running on a large SMP cluster determined that it was inconclusive whether the extra development effort outweighed the perceived benefit [Nakajima:2005] Hybrid parallel programs using MPI and OpenMP have been developed mainly in the past decade with mixed results comparing performance relative to MPI-only parallel versions.[Henty:2000], Chow & Hysom:2001]

Matrix Multiplication, C = A * B
where A is an n x l matrix and B is an l x m matrix.

Matrix Multiplication Again
Each cell Ci,j can be computed independently of the other elements of the C matrix If we are multiplying to matrices that are NxN, then we can use up to N2 processors without any communication But the computation of Ci,j is a dot-product of two arrays. In other words, it is a reduction

Matrix Multiplication Again
If we have more than N2 processors, then we need to divide up the work of the reduction among processors The reduction requires communication Although we can do a reduction using MPI, the communication is much slower than doing a reduction in OpenMP However, N usually needs to be really big to justify parallel computation, and how likely are we to have N3 processors available!

How is this possible? OpenMP is supported by icc and gcc compilers:
gcc –fopenmp <file.c> icc –openmp <file.c> MPI is a library that is linked with your C program: mpicc <file.c> mpicc uses gcc linked with the appropriate libraries

How is this possible? So to use both MPI and OpenMP:
mpicc –fopenmp <file.c> mpicc is simply a script

Simple Example int main(int argc, char *argv[]) { int i, j, blksz, rank, NP, tid; char *usage = "Usage: %s \n"; FILE *fd; char message[80]; MPI_Init (&argc, &argv); MPI_Comm_size (MPI_COMM_WORLD, &NP); MPI_Comm_rank (MPI_COMM_WORLD, &rank); blksz = (int) ceil (((double) N)/NP);

Simple Example Loop i parallelized across computers
#pragma omp parallel private (tid, i, j) { tid = omp_get_thread_num(); for (i = rank*blksz; i < min((rank + 1) * blksz, N); i++) { #pragma omp for for (j = 0; j < N; j++) { printf ("rank %d, thread %d: executing loop iteration i=%d j=%d\n", rank, tid, i, j); } Loop i parallelized across computers Loop j parallelized across threads

Simple Example Result rank 0, thread 4: executing loop iteration i=0 j=4 rank 0, thread 2: executing loop iteration i=0 j=2 rank 0, thread 1: executing loop iteration i=0 j=1 rank 1, thread 0: executing loop iteration i=2 j=0 rank 1, thread 4: executing loop iteration i=2 j=4 rank 1, thread 2: executing loop iteration i=2 j=2 rank 1, thread 3: executing loop iteration i=2 j=3 rank 0, thread 0: executing loop iteration i=0 j=0 rank 1, thread 1: executing loop iteration i=2 j=1 rank 0, thread 3: executing loop iteration i=0 j=3 rank 2, thread 2: executing loop iteration i=4 j=2 rank 2, thread 0: executing loop iteration i=4 j=0 rank 2, thread 3: executing loop iteration i=4 j=3

Simple Example Result rank 2, thread 4: executing loop iteration i=4 j=4 rank 2, thread 1: executing loop iteration i=4 j=1 rank 0, thread 2: executing loop iteration i=1 j=2 rank 0, thread 4: executing loop iteration i=1 j=4 rank 0, thread 3: executing loop iteration i=1 j=3 rank 0, thread 0: executing loop iteration i=1 j=0 rank 0, thread 1: executing loop iteration i=1 j=1 rank 1, thread 0: executing loop iteration i=3 j=0 rank 1, thread 2: executing loop iteration i=3 j=2 rank 1, thread 3: executing loop iteration i=3 j=3 rank 1, thread 1: executing loop iteration i=3 j=1 rank 1, thread 4: executing loop iteration i=3 j=4

Back to Matrix Multiplication
MPI_Init (&argc, &argv); MPI_Comm_size (MPI_COMM_WORLD, &NP); MPI_Comm_rank (MPI_COMM_WORLD, &rank); blksz = (int) ceil (((double) N)/NP); MPI_Scatter (a, N*blksz, MPI_FLOAT, a, N*blksz, MPI_FLOAT, 0, MPI_COMM_WORLD); MPI_Bcast (b, N*N, MPI_FLOAT, 0, MPI_COMM_WORLD);

#pragma omp parallel private (tid, i, j, k) { for (i = 0; i < blksz && rank * blksz < N; i++) { #pragma omp for nowait for (j = 0; j < N; j++) { c[i][j] = 0.0; for (k = 0; k < N; k++) { c[i][j] += a[i][k] * b[k][j]; }

Matrix Multiplication Results
$ diff out MMULT.o5356 1c1 < elapsed_time= (seconds) --- >elapsed_time= (seconds) $ diff out MMULT.o5357 > elapsed_time= (seconds) $ Sequential Execution Time Hybrid Execution Time MPI-only Execution Time Hybrid did not do better than MPI only

#pragma omp parallel private (tid, i, j, k) { #pragma omp for nowait for (i = 0; i < blksz && rank * blksz < N; i++) { for (j = 0; j < N; j++) { c[i][j] = 0.0; for (k = 0; k < N; k++) { c[i][j] += a[i][k] * b[k][j]; } Perhaps we could do better parallizing the i loop both with MPI and OpenMP But this loop is too complicated for OpenMP

#pragma omp parallel private (tid, i, j,k) { #pragma omp for nowait for (i = 0; i < blksz; i++) { if (rank * blksz < N) { for (j = 0; j < N; j++) { c[i][j] = 0.0; for (k = 0; k < N; k++) { c[i][j] += a[i][k] * b[k][j]; } An if statement can simplify the loop

Matrix Multiplication Results
$ diff out MMULT.o5356 1c1 < elapsed_time= (seconds) --- >elapsed_time= (seconds) Sequential Execution Time Hybrid Execution Time Still not better

Discussion Point Why does the hybrid approach not outperform MPI-only for this problem? For what kinds of problem might a hybrid approach do better?

Questions

Hybrid Parallel Programming

Similar presentations

Presentation on theme: "Hybrid Parallel Programming"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Hybrid Parallel Programming

Similar presentations

Presentation on theme: "Hybrid Parallel Programming"— Presentation transcript:

Similar presentations

About project

Feedback