OMP2MPI A. Saà-Garriga et al., CAIAC (UAB)HIP3ES 1/28 A. Saà-Garriga, D. Castells-Rufas and J. Carrabina {Albert.saa, David.castells,

OMP2MPI A. Saà-Garriga et al., CAIAC (UAB)HIP3ES 1/28 A. Saà-Garriga, D. Castells-Rufas and J. Carrabina {Albert.saa, David.castells, Jordi.Carrabina}@uab.cat Microelectronic and Electronic Systems Department Universitat Autònoma de Barcelona. UAB 21/01/2015 OMP2MPI: Automatic MPI code generation from OpenMP programs OMP2MPI: Automatic MPI code generation from OpenMP programs

OMP2MPI A. Saà-Garriga et al., CAIAC (UAB)HIP3ES 2/28 1 1 Introduction 2 2 OMP2MPI Compiler 3 3 Results 4 4 Conclusions Intro Compiler Results Conclusions

OMP2MPI A. Saà-Garriga et al., CAIAC (UAB)HIP3ES 3/28 1 1 Introduction 2 2 OMP2MPI Compiler 3 3 Results 4 4 Conclusions Intro Compiler Results Conclusions

OMP2MPI A. Saà-Garriga et al., CAIAC (UAB)HIP3ES 4/28 OpenMP and MPI De-facto standards usually used for programming High Performance Computing applications(HPC)  MPI  Usually associated with large Distributed Memory Systems  …but implementations take profit of shared memory inside nodes  … and can also be used for distribute memory many-cores  Very intrusive immersed on the sequential code  OpenMP  Simple, easy to learn  Programmer is exposed to a shared memory  …usually so, but several options to extend it to different architectures  Harder to scale up efficiently Intro Compiler Results Conclusions

OMP2MPI A. Saà-Garriga et al., CAIAC (UAB)HIP3ES 5/28 Goal  Generate MPI from OpenMP  Go beyond the limits of shared memory (with MPI) while starting from an easy OpenMP source code  To use it in large supercomputers and distributed memory embedded systems (STORM) Intro Compiler Results Conclusions Tianhe-2 Supercomputer (DM) Current #1 in Top500 16000 nodes – 3.120.000 cores Bull Bullion Node (SM) 160 cores UAB’s FPGA based MPSoC (DM) / ocMPI 16 cores

OMP2MPI A. Saà-Garriga et al., CAIAC (UAB)HIP3ES 6/28Intro Compiler Results Conclusions 1 1 Introduction 2 2 OMP2MPI Compiler 3 3 Results 4 4 Conclusions

OMP2MPI A. Saà-Garriga et al., CAIAC (UAB)HIP3ES 7/28 OMP2MPI Intro Compiler Results Conclusions  Source to Source compiler  Based on Mercurium (BSC) compilation framework

OMP2MPI A. Saà-Garriga et al., CAIAC (UAB)HIP3ES 8/28 Input Source Code Intro Compiler Results Conclusions  We focus on  #pragma omp parallel for  reduction operations  We transform them into a MPI application using a reduced MPI subset (MPI_Init, MPI_Send, MPI_Receive, MPI_Finalize)  We support loops with bounded limits and constantly spaced iterations  Private variables are correctly handled by design  Shared variables are maintained by master node void main() {... #pragma omp parallel for target mpi for(int i = 0; i<N; ++i) { double x = (i+0.5) * step; sum[i] = 4.0/(1.0+x*x); } #pragma omp parallel for reduction(+:total) target mpi for (int j=0; j<N; ++j){ total += sum[j]; }... }

OMP2MPI A. Saà-Garriga et al., CAIAC (UAB)HIP3ES 9/28 Generate MPI Source Code Intro Compiler Results Conclusions  The main idea is to divide the OpenMP block into a master/slaves task.  MPI applications must be initialized and finalized  Rank 0 contains all the sequential code from the orginal OpenMP application

OMP2MPI A. Saà-Garriga et al., CAIAC (UAB)HIP3ES 10/28 Shared variable Analysis Intro Compiler Results Conclusions  For each of shared variables used inside an OpenMP block to transform OMP2MPI analyze the Abstract Syntax Tree to identify when/wether they are accessed  Depending on that information MPI_Send / MPI_Recv intructions are inserted to transfer the data to the appropiate slaves.

OMP2MPI A. Saà-Garriga et al., CAIAC (UAB)HIP3ES 11/28 Static Task division Intro Compiler Results Conclusions Master Slave 1Slave 2Slave N Iteration start OUT Variables Number of iterations IN/INOUT variables Iteration start Number of iterations IN/INOUT variables Iteration start Number of iterations IN/INOUT variables OUT Variables Iteration start Number of iterations IN/INOUT variables … … Iteration start Number of iterations IN/INOUT variables …  The outer loop is scheduled in round robin fashion by using MPI_Recv from specific ranks  Could lead to an unbalanced load

OMP2MPI A. Saà-Garriga et al., CAIAC (UAB)HIP3ES 12/28Intro Compiler Results Conclusions Master Slave 1Slave 2Slave N Iteration start Number of iterations IN/INOUT variables Iteration start Number of iterations IN/INOUT variables Iteration start Number of iterations IN/INOUT variables Data Iteration start Number of iterations IN/INOUT variables … Data Iteration start Number of iterations IN/INOUT variables … Static Task division  The outer loop is scheduled in by using ANY_SOURCE MPI_Recv  More efficient

OMP2MPI A. Saà-Garriga et al., CAIAC (UAB)HIP3ES 13/28Intro Compiler Results Conclusions 1 1 Introduction 2 2 OMP2MPI Compiler 3 3 Results 4 4 Conclusions

OMP2MPI A. Saà-Garriga et al., CAIAC (UAB)HIP3ES 14/28 Source Code Example Intro Compiler Results Conclusions void main() {... #pragma omp parallel for schedule(dynamic) target mpi for(int i = 0; i<N; ++i) { double x = (i+0.5) * step; sum[i] = 4.0/(1.0+x*x); } #pragma omp parallel for reduction(+:total) target mpi for (int j=0; j<N; ++j){ total += sum[j]; }... }

OMP2MPI A. Saà-Garriga et al., CAIAC (UAB)HIP3ES 15/28Intro Compiler Results Conclusions... const int FTAG = 0; const int ATAG = 1; int partSize = ((N- 0)) / (size - 1), offset; if (myid == 0) { int followIN = 0, killed = 0; for (int to = 1;to < size;++to) { MPI_Send(&followIN, 1, MPI_INT, to, ATAG, MPI_COMM_WORLD); MPI_Send(&partSize, 1, MPI_INT, to, ATAG, MPI_COMM_WORLD); followIN += partSize; } while (1) { MPI_Recv(&offset, 1, MPI_INT, MPI_ANY_SOURCE,...); int source = stat.MPI_SOURCE; MPI_Recv(&partSize, 1, MPI_INT, source,...); MPI_Recv(&sum[offset], partSize, MPI_DOUBLE, source,...); if (followIN > N ) { MPI_Send(&offset, 1, MPI_INT, source, FTAG,...); killed++; } else { partSize = min(partSize, N – followIN); MPI_Send(&followIN, 1, MPI_INT, source, ATAG,...); MPI_Send(&partSize, 1, MPI_INT, source, ATAG,...); } followIN += partSize; if (killed == size - 1) break; } else { while (1) { MPI_Recv(&offset, 1, MPI_INT, MPI_ANY_SOURCE, MPI_ANY_TAG,…); if (stat.MPI_TAG == ATAG) { MPI_Recv(&partSize, 1, MPI_INT, 0, MPI_ANY_TAG,…); for (int i = offset; i < offset + partSize; ++i) { double x = (i + 0.5) * step; sum[i] = 4.0 / (1.0 + x * x); } MPI_Send(&offset, 1, MPI_INT, 0, 0, …); MPI_Send(&partSize, 1, MPI_INT, 0, 0, …); MPI_Send(&sum[offset], partSize, MPI_DOUBLE, 0, 0, …); } else if (stat.MPI_TAG == FTAG) { break; } #pragma omp parallel for schedule(dynamic) target mpi for(int i = 0; i<N; ++i) { double x = (i+0.5) * step; sum[i] = 4.0/(1.0+x*x); }

OMP2MPI A. Saà-Garriga et al., CAIAC (UAB)HIP3ES 16/28 Source Code Example Intro Compiler Results Conclusions void main() {... #pragma omp parallel for target mpi for(int i = 0; i<N; ++i) { double x = (i+0.5) * step; sum[i] = 4.0/(1.0+x*x); } #pragma omp parallel for reduction(+:total) schedule(dynamic) target mpi for (int j=0; j<N; ++j){ total += sum[j]; }... }

OMP2MPI A. Saà-Garriga et al., CAIAC (UAB)HIP3ES 17/28Intro Compiler Results Conclusions double work0; int j = 0; partSize = ((N - 0)) / (size - 1); if (myid == 0) { int followIN = 0; int killed = 0; for (int to = 1;to < size; ++to) { MPI_Send(&followIN, 1, MPI_INT, to, ATAG,…); MPI_Send(&partSize, 1, MPI_INT, to, ATAG, …); followIN += partSize; } while (1) { MPI_Recv(&offset, 1, MPI_INT, MPI_ANY_SOURCE, …); int source = stat.MPI_SOURCE; MPI_Recv(&partSize, 1, MPI_INT, source, MPI_ANY_TAG,…); MPI_Recv(&work0, 1, MPI_DOUBLE, source, MPI_ANY_TAG, …); total += work0; if (followIN > N ) { MPI_Send(&offset, 1, MPI_INT, source, FTAG,...); killed++; } else { partSize = min(partSize, N – followIN); MPI_Send(&followIN, 1, MPI_INT, source, ATAG,...); MPI_Send(&partSize, 1, MPI_INT, source, ATAG,...); } followIN += partSize; if (killed == size - 1) break; } if (myid != 0) { while (1) { MPI_Recv(&offset, 1, MPI_INT, MPI_ANY_SOURCE, …); if (stat.MPI_TAG == ATAG) { MPI_Recv(&partSize, 1, MPI_INT, 0, MPI_ANY_TAG, …); total = 0; for (int j = offset; j < offset + partSize; ++j) { total += sum[j]; } MPI_Send(&offset, 1, MPI_INT, 0, 0, MPI_COMM_WORLD); MPI_Send(&partSize, 1, MPI_INT, 0, 0, MPI_COMM_WORLD); MPI_Send(&total, 1, MPI_DOUBLE, 0, 0, MPI_COMM_WORLD); } else if (stat.MPI_TAG == FTAG) break; } #pragma omp parallel for reduction(+:total) schedule(dynamic) target mpi for (int j=0; j<N; ++j){ total += sum[j]; }

OMP2MPI A. Saà-Garriga et al., CAIAC (UAB)HIP3ES 18/28 Experimental Results Intro Compiler Results Conclusions  Experiment Characteristics  Sequential, OpenMP and MPI(bullxmpi)  64 cpus E7-4800 with 2.40 GHz(Bullion quadri module)  Scalability chart with 16, 32 and 64 cores  Test made using a subset of the Polybench Benchmark

OMP2MPI A. Saà-Garriga et al., CAIAC (UAB)HIP3ES 19/28 Experimental Results Intro Compiler Results Conclusions GEMM2MM TRMM SYR2K SYRK MVT

OMP2MPI A. Saà-Garriga et al., CAIAC (UAB)HIP3ES 20/28 Experimental Results Intro Compiler Results Conclusions SEIDEL LUDCMP JACOBI 2DCOVARIANCE CORRELATIONCONVOLUTION

OMP2MPI A. Saà-Garriga et al., CAIAC (UAB)HIP3ES 21/28Intro Compiler Results Conclusions 1 1 Introduction 2 2 OMP2HMPP Compiler 3 3 Results 4 4 Conclusions

OMP2MPI A. Saà-Garriga et al., CAIAC (UAB)HIP3ES 22/28 Conclusions  The programmer avoid to expend time in learning MPI functions.  Tested set of problems from Polybench[8] obtains in most of cases with more than 20x of speedup for 64 cores compared to the sequential version.  An average speedup over 4x compared to OpenMP.  OMP2MPI gives a solution that allow further optimizations by an expert that want to achieve better results.  OMP2MPI automatically genarates MPI source code. Allowing that the program exploits non shared-memory architectures such as cluster, or Network-on-Chip based(NoC-based) Multiprocessors- System-onChip (MPSoC). …thanks for your attention! Intro Compiler Results Conclusions

OMP2MPI A. Saà-Garriga et al., CAIAC (UAB)HIP3ES 23/28Intro Compiler Results Conclusions Thanks for your attention

OMP2MPI A. Saà-Garriga et al., CAIAC (UAB)HIP3ES 24/28 Current Limitations Intro Compiler Results Conclusions  Complex for loops are not supported by OMP2MPI #pragma omp parallel for for(int i=0; i<100; i+= cos(i)) { … }  The step is not constant on iterations

OMP2MPI A. Saà-Garriga et al., CAIAC (UAB)HIP3ES 25/28 Shared Memory Handling Intro Compiler Results Conclusions var[i][j] = 2*i; var[i] = j*2;

OMP2MPI A. Saà-Garriga et al., CAIAC (UAB)HIP3ES 26/28 Shared Memory Handling Intro Compiler Results Conclusions

OMP2MPI A. Saà-Garriga et al., CAIAC (UAB)HIP3ES 27/28 Example of Current Limitations Intro Compiler Results Conclusions  Iterator in second index array access  Concurrent access to shared variable #pragma omp parallel for for(int i=0; i<100; i++) { for(int j=0; j<100; j++) { var[j] = var[i]*2; } #pragma omp parallel for for(int i=0; i<100; i++) { for(int j=0; j<100; j++) { var[j][i] = 2*j; }

OMP2MPI A. Saà-Garriga et al., CAIAC (UAB)HIP3ES 1/28 A. Saà-Garriga, D. Castells-Rufas and J. Carrabina {Albert.saa, David.castells,

Similar presentations

Presentation on theme: "OMP2MPI A. Saà-Garriga et al., CAIAC (UAB)HIP3ES 1/28 A. Saà-Garriga, D. Castells-Rufas and J. Carrabina {Albert.saa, David.castells,"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

OMP2MPI A. Saà-Garriga et al., CAIAC (UAB)HIP3ES 1/28 A. Saà-Garriga, D. Castells-Rufas and J. Carrabina {Albert.saa, David.castells,

Similar presentations

Presentation on theme: "OMP2MPI A. Saà-Garriga et al., CAIAC (UAB)HIP3ES 1/28 A. Saà-Garriga, D. Castells-Rufas and J. Carrabina {Albert.saa, David.castells,"— Presentation transcript:

Similar presentations

About project

Feedback