OMP2MPI A. Saà-Garriga et al., CAIAC (UAB)HIP3ES 1/28 A. Saà-Garriga, D. Castells-Rufas and J. Carrabina {Albert.saa, David.castells, Microelectronic and Electronic Systems Department Universitat Autònoma de Barcelona. UAB 21/01/2015 OMP2MPI: Automatic MPI code generation from OpenMP programs OMP2MPI: Automatic MPI code generation from OpenMP programs
OMP2MPI A. Saà-Garriga et al., CAIAC (UAB)HIP3ES 2/ Introduction 2 2 OMP2MPI Compiler 3 3 Results 4 4 Conclusions Intro Compiler Results Conclusions
OMP2MPI A. Saà-Garriga et al., CAIAC (UAB)HIP3ES 3/ Introduction 2 2 OMP2MPI Compiler 3 3 Results 4 4 Conclusions Intro Compiler Results Conclusions
OMP2MPI A. Saà-Garriga et al., CAIAC (UAB)HIP3ES 4/28 OpenMP and MPI De-facto standards usually used for programming High Performance Computing applications(HPC) MPI Usually associated with large Distributed Memory Systems …but implementations take profit of shared memory inside nodes … and can also be used for distribute memory many-cores Very intrusive immersed on the sequential code OpenMP Simple, easy to learn Programmer is exposed to a shared memory …usually so, but several options to extend it to different architectures Harder to scale up efficiently Intro Compiler Results Conclusions
OMP2MPI A. Saà-Garriga et al., CAIAC (UAB)HIP3ES 5/28 Goal Generate MPI from OpenMP Go beyond the limits of shared memory (with MPI) while starting from an easy OpenMP source code To use it in large supercomputers and distributed memory embedded systems (STORM) Intro Compiler Results Conclusions Tianhe-2 Supercomputer (DM) Current #1 in Top nodes – cores Bull Bullion Node (SM) 160 cores UAB’s FPGA based MPSoC (DM) / ocMPI 16 cores
OMP2MPI A. Saà-Garriga et al., CAIAC (UAB)HIP3ES 6/28Intro Compiler Results Conclusions 1 1 Introduction 2 2 OMP2MPI Compiler 3 3 Results 4 4 Conclusions
OMP2MPI A. Saà-Garriga et al., CAIAC (UAB)HIP3ES 7/28 OMP2MPI Intro Compiler Results Conclusions Source to Source compiler Based on Mercurium (BSC) compilation framework
OMP2MPI A. Saà-Garriga et al., CAIAC (UAB)HIP3ES 8/28 Input Source Code Intro Compiler Results Conclusions We focus on #pragma omp parallel for reduction operations We transform them into a MPI application using a reduced MPI subset (MPI_Init, MPI_Send, MPI_Receive, MPI_Finalize) We support loops with bounded limits and constantly spaced iterations Private variables are correctly handled by design Shared variables are maintained by master node void main() {... #pragma omp parallel for target mpi for(int i = 0; i<N; ++i) { double x = (i+0.5) * step; sum[i] = 4.0/(1.0+x*x); } #pragma omp parallel for reduction(+:total) target mpi for (int j=0; j<N; ++j){ total += sum[j]; }... }
OMP2MPI A. Saà-Garriga et al., CAIAC (UAB)HIP3ES 9/28 Generate MPI Source Code Intro Compiler Results Conclusions The main idea is to divide the OpenMP block into a master/slaves task. MPI applications must be initialized and finalized Rank 0 contains all the sequential code from the orginal OpenMP application
OMP2MPI A. Saà-Garriga et al., CAIAC (UAB)HIP3ES 10/28 Shared variable Analysis Intro Compiler Results Conclusions For each of shared variables used inside an OpenMP block to transform OMP2MPI analyze the Abstract Syntax Tree to identify when/wether they are accessed Depending on that information MPI_Send / MPI_Recv intructions are inserted to transfer the data to the appropiate slaves.
OMP2MPI A. Saà-Garriga et al., CAIAC (UAB)HIP3ES 11/28 Static Task division Intro Compiler Results Conclusions Master Slave 1Slave 2Slave N Iteration start OUT Variables Number of iterations IN/INOUT variables Iteration start Number of iterations IN/INOUT variables Iteration start Number of iterations IN/INOUT variables OUT Variables Iteration start Number of iterations IN/INOUT variables … … Iteration start Number of iterations IN/INOUT variables … The outer loop is scheduled in round robin fashion by using MPI_Recv from specific ranks Could lead to an unbalanced load
OMP2MPI A. Saà-Garriga et al., CAIAC (UAB)HIP3ES 12/28Intro Compiler Results Conclusions Master Slave 1Slave 2Slave N Iteration start Number of iterations IN/INOUT variables Iteration start Number of iterations IN/INOUT variables Iteration start Number of iterations IN/INOUT variables Data Iteration start Number of iterations IN/INOUT variables … Data Iteration start Number of iterations IN/INOUT variables … Static Task division The outer loop is scheduled in by using ANY_SOURCE MPI_Recv More efficient
OMP2MPI A. Saà-Garriga et al., CAIAC (UAB)HIP3ES 13/28Intro Compiler Results Conclusions 1 1 Introduction 2 2 OMP2MPI Compiler 3 3 Results 4 4 Conclusions
OMP2MPI A. Saà-Garriga et al., CAIAC (UAB)HIP3ES 14/28 Source Code Example Intro Compiler Results Conclusions void main() {... #pragma omp parallel for schedule(dynamic) target mpi for(int i = 0; i<N; ++i) { double x = (i+0.5) * step; sum[i] = 4.0/(1.0+x*x); } #pragma omp parallel for reduction(+:total) target mpi for (int j=0; j<N; ++j){ total += sum[j]; }... }
OMP2MPI A. Saà-Garriga et al., CAIAC (UAB)HIP3ES 15/28Intro Compiler Results Conclusions... const int FTAG = 0; const int ATAG = 1; int partSize = ((N- 0)) / (size - 1), offset; if (myid == 0) { int followIN = 0, killed = 0; for (int to = 1;to < size;++to) { MPI_Send(&followIN, 1, MPI_INT, to, ATAG, MPI_COMM_WORLD); MPI_Send(&partSize, 1, MPI_INT, to, ATAG, MPI_COMM_WORLD); followIN += partSize; } while (1) { MPI_Recv(&offset, 1, MPI_INT, MPI_ANY_SOURCE,...); int source = stat.MPI_SOURCE; MPI_Recv(&partSize, 1, MPI_INT, source,...); MPI_Recv(&sum[offset], partSize, MPI_DOUBLE, source,...); if (followIN > N ) { MPI_Send(&offset, 1, MPI_INT, source, FTAG,...); killed++; } else { partSize = min(partSize, N – followIN); MPI_Send(&followIN, 1, MPI_INT, source, ATAG,...); MPI_Send(&partSize, 1, MPI_INT, source, ATAG,...); } followIN += partSize; if (killed == size - 1) break; } else { while (1) { MPI_Recv(&offset, 1, MPI_INT, MPI_ANY_SOURCE, MPI_ANY_TAG,…); if (stat.MPI_TAG == ATAG) { MPI_Recv(&partSize, 1, MPI_INT, 0, MPI_ANY_TAG,…); for (int i = offset; i < offset + partSize; ++i) { double x = (i + 0.5) * step; sum[i] = 4.0 / (1.0 + x * x); } MPI_Send(&offset, 1, MPI_INT, 0, 0, …); MPI_Send(&partSize, 1, MPI_INT, 0, 0, …); MPI_Send(&sum[offset], partSize, MPI_DOUBLE, 0, 0, …); } else if (stat.MPI_TAG == FTAG) { break; } #pragma omp parallel for schedule(dynamic) target mpi for(int i = 0; i<N; ++i) { double x = (i+0.5) * step; sum[i] = 4.0/(1.0+x*x); }
OMP2MPI A. Saà-Garriga et al., CAIAC (UAB)HIP3ES 16/28 Source Code Example Intro Compiler Results Conclusions void main() {... #pragma omp parallel for target mpi for(int i = 0; i<N; ++i) { double x = (i+0.5) * step; sum[i] = 4.0/(1.0+x*x); } #pragma omp parallel for reduction(+:total) schedule(dynamic) target mpi for (int j=0; j<N; ++j){ total += sum[j]; }... }
OMP2MPI A. Saà-Garriga et al., CAIAC (UAB)HIP3ES 17/28Intro Compiler Results Conclusions double work0; int j = 0; partSize = ((N - 0)) / (size - 1); if (myid == 0) { int followIN = 0; int killed = 0; for (int to = 1;to < size; ++to) { MPI_Send(&followIN, 1, MPI_INT, to, ATAG,…); MPI_Send(&partSize, 1, MPI_INT, to, ATAG, …); followIN += partSize; } while (1) { MPI_Recv(&offset, 1, MPI_INT, MPI_ANY_SOURCE, …); int source = stat.MPI_SOURCE; MPI_Recv(&partSize, 1, MPI_INT, source, MPI_ANY_TAG,…); MPI_Recv(&work0, 1, MPI_DOUBLE, source, MPI_ANY_TAG, …); total += work0; if (followIN > N ) { MPI_Send(&offset, 1, MPI_INT, source, FTAG,...); killed++; } else { partSize = min(partSize, N – followIN); MPI_Send(&followIN, 1, MPI_INT, source, ATAG,...); MPI_Send(&partSize, 1, MPI_INT, source, ATAG,...); } followIN += partSize; if (killed == size - 1) break; } if (myid != 0) { while (1) { MPI_Recv(&offset, 1, MPI_INT, MPI_ANY_SOURCE, …); if (stat.MPI_TAG == ATAG) { MPI_Recv(&partSize, 1, MPI_INT, 0, MPI_ANY_TAG, …); total = 0; for (int j = offset; j < offset + partSize; ++j) { total += sum[j]; } MPI_Send(&offset, 1, MPI_INT, 0, 0, MPI_COMM_WORLD); MPI_Send(&partSize, 1, MPI_INT, 0, 0, MPI_COMM_WORLD); MPI_Send(&total, 1, MPI_DOUBLE, 0, 0, MPI_COMM_WORLD); } else if (stat.MPI_TAG == FTAG) break; } #pragma omp parallel for reduction(+:total) schedule(dynamic) target mpi for (int j=0; j<N; ++j){ total += sum[j]; }
OMP2MPI A. Saà-Garriga et al., CAIAC (UAB)HIP3ES 18/28 Experimental Results Intro Compiler Results Conclusions Experiment Characteristics Sequential, OpenMP and MPI(bullxmpi) 64 cpus E with 2.40 GHz(Bullion quadri module) Scalability chart with 16, 32 and 64 cores Test made using a subset of the Polybench Benchmark
OMP2MPI A. Saà-Garriga et al., CAIAC (UAB)HIP3ES 19/28 Experimental Results Intro Compiler Results Conclusions GEMM2MM TRMM SYR2K SYRK MVT
OMP2MPI A. Saà-Garriga et al., CAIAC (UAB)HIP3ES 20/28 Experimental Results Intro Compiler Results Conclusions SEIDEL LUDCMP JACOBI 2DCOVARIANCE CORRELATIONCONVOLUTION
OMP2MPI A. Saà-Garriga et al., CAIAC (UAB)HIP3ES 21/28Intro Compiler Results Conclusions 1 1 Introduction 2 2 OMP2HMPP Compiler 3 3 Results 4 4 Conclusions
OMP2MPI A. Saà-Garriga et al., CAIAC (UAB)HIP3ES 22/28 Conclusions The programmer avoid to expend time in learning MPI functions. Tested set of problems from Polybench[8] obtains in most of cases with more than 20x of speedup for 64 cores compared to the sequential version. An average speedup over 4x compared to OpenMP. OMP2MPI gives a solution that allow further optimizations by an expert that want to achieve better results. OMP2MPI automatically genarates MPI source code. Allowing that the program exploits non shared-memory architectures such as cluster, or Network-on-Chip based(NoC-based) Multiprocessors- System-onChip (MPSoC). …thanks for your attention! Intro Compiler Results Conclusions
OMP2MPI A. Saà-Garriga et al., CAIAC (UAB)HIP3ES 23/28Intro Compiler Results Conclusions Thanks for your attention
OMP2MPI A. Saà-Garriga et al., CAIAC (UAB)HIP3ES 24/28 Current Limitations Intro Compiler Results Conclusions Complex for loops are not supported by OMP2MPI #pragma omp parallel for for(int i=0; i<100; i+= cos(i)) { … } The step is not constant on iterations
OMP2MPI A. Saà-Garriga et al., CAIAC (UAB)HIP3ES 25/28 Shared Memory Handling Intro Compiler Results Conclusions var[i][j] = 2*i; var[i] = j*2;
OMP2MPI A. Saà-Garriga et al., CAIAC (UAB)HIP3ES 26/28 Shared Memory Handling Intro Compiler Results Conclusions
OMP2MPI A. Saà-Garriga et al., CAIAC (UAB)HIP3ES 27/28 Example of Current Limitations Intro Compiler Results Conclusions Iterator in second index array access Concurrent access to shared variable #pragma omp parallel for for(int i=0; i<100; i++) { for(int j=0; j<100; j++) { var[j] = var[i]*2; } #pragma omp parallel for for(int i=0; i<100; i++) { for(int j=0; j<100; j++) { var[j][i] = 2*j; }