Presentation is loading. Please wait.

Presentation is loading. Please wait.

1 ITCS4145 Parallel Programming B. Wilkinson March 23, 2016. hybrid-abw.ppt Hybrid Parallel Programming Introduction.

Similar presentations


Presentation on theme: "1 ITCS4145 Parallel Programming B. Wilkinson March 23, 2016. hybrid-abw.ppt Hybrid Parallel Programming Introduction."— Presentation transcript:

1 1 ITCS4145 Parallel Programming B. Wilkinson March 23, 2016. hybrid-abw.ppt Hybrid Parallel Programming Introduction

2 Hybrid Systems Since most computers are multi-core, most clusters have both shared-memory and distributed-memory. Ethernet Switch Core Memory Core Multi-core Computer Core Memory Core Multi-core Computer Core Memory Core Multi-core Computer Core Memory Core Multi-core Computer 2

3 Hybrid (MPI-OpenMP) Parallel Computing MPI to run processes concurrently on each computer OpenMP to run threads concurrently on each core of a computer Advantage: we can make use of shared-memory where communication is required Why? – Because inter-computer communication is an order of magnitude slower than synchronization 3

4 4 Message-passing routines used to pass messages between computer systems and threads execute on each computer system using the multiple cores on the system

5 How to create a hybrid OpenMP- MPI program Write source code with both MPI routines and OpenMP directives/routines Compile using mpicc –mpicc uses gcc linked with appropriate MPI libraries. –gcc supports OpenMP with –fopenmp option. So can use: mpicc -fopenmp -o hybrid hybrid.c Execute as an MPI program. e.g. on UNCC cluster cci-gridgw.uncc.edu: mpiexec.hydra -f -n./hybrid 5

6 Example #include #include "mpi.h" #define N 10 void openmp_code(int rank){ … // next slide } main(int argc, char **argv ) { char message[20]; int i,rank, size, type=99; MPI_Status status; MPI_Init(&argc, &argv); MPI_Comm_size(MPI_COMM_WORLD,&size); MPI_Comm_rank(MPI_COMM_WORLD, &rank); if(rank == 0) { strcpy(message, "Hello, world"); for (i=1; i<size; i++) MPI_Send(message, 13, MPI_CHAR, i, type, MPI_COMM_WORLD); } else MPI_Recv(message, 20, MPI_CHAR, 0, type, MPI_COMM_WORLD, &status); openmp_code(rank); //all MPI processes run OpenMP code, no message passing printf( "Message from process = %d : %.13s\n", rank,message); MPI_Finalize(); return (0); } 6

7 void openmp_code(int rank){ int nthreads, i, t; double a[N], b[N], c[N]; for (i=0; i < N; i++) a[i] = b[i] = i * 1.0; // initialize arrays t = 8; omp_set_num_threads(t); // set # of threads for each MPI process printf("MPI process %d, number of threads = %d\n",rank,t); #pragma omp parallel for shared(a,b,c) for (i=0; i<N; i++) { c[i] = a[i] + b[i]; printf("Process %d: Thread %d: c[%d] = %f\n", rank, omp_get_thread_num(), i, c[i]); } return; } 7

8 8

9 9 #include … #define N 4 int main(int argc, char *argv[]) { int i, j, blksz, rank, P, start,end; MPI_Init (&argc, &argv); MPI_Comm_size (MPI_COMM_WORLD, &P); MPI_Comm_rank (MPI_COMM_WORLD, &rank); blksz = N/P; if (N % P != 0) printf("ERROR: N must be a multiple of P. N = 4\n"); start = rank*blksz; end = start + blksz; for (i = start; i < end; i++) { #pragma omp parallel for for (j = 0; j < N; j++) { printf ("Process rank %d, thread %d: executing loop iteration i=%d j=%d\n",rank,omp_get_thread_num(),i,j); } MPI_Finalize(); return (0); } Loop i parallelized across processes Loop j parallelized across threads Parallelizing a double for loop

10 10 Sample output

11 Hybrid (MPI-OpenMP) Parallel Computing Caution: Using the hybrid approach may not necessarily result in increased performance though – will strongly depend upon application. 11

12 Matrix Multiplication, C = A * B 12 where A is an n x l matrix and B is an l x m matrix.

13 One way to parallelize Matrix multiplication using hybrid approach for (i = 0; i < N; i++) for (j = 0; j < N; j++) { c[i][j] = 0.0; for (k = 0; k < N; k++) { c[i][j] += a[i][k] * b[k][j]; } 13 Parallelize i loop into partitioned among the computers with MPI Parallelize j loop into partitioned among cores within each computer, using OpenMP

14 14 MPI_Scatter(A, blksz*N, … );// Scatter input matrix A MPI_Bcast(B, N*N, … ); // Broadcast input matrix B for(i = 0 ; i < blksz; i++) { #pragma omp parallel for private (sum,k) for (j = 0 ; j < N ; j++) { sum = 0; for (k = 0 ; k < N ; k++) { sum += A[i][k] * B[k][j]; } C[i][j] = sum; } MPI_Gather(C, blksz*N, … ); Simply add this one statement to MPI code for matrix multiplication Parallelize i loop into partitions among processes on computers with MPI MPI-OpenMP Matrix Multiplication Parallelize j loop on each computer into partitioned using OpenMP

15 15

16 16 Hybrid did not do better than MPI only

17 17 Perhaps we could do better parallelizing i loop both with MPI and OpenMP Parallelize i loop into partitions among processes/threads with MPI and OpenMP j loop not parallelized No better!

18 Discussion Although demos are done on a single 4-core machine*, experiments on a cluster do not show improvements either. Why does the hybrid approach not outperform MPI- only for this problem? For what kinds of problem might a hybrid approach do better? Note problem size is small 256 x 256 arrays 18 *Intel i7-3770 3.4 GHz (4-core hyperthreaded) with 16 GB main memory

19 Questions 19


Download ppt "1 ITCS4145 Parallel Programming B. Wilkinson March 23, 2016. hybrid-abw.ppt Hybrid Parallel Programming Introduction."

Similar presentations


Ads by Google