OMP2MPI A. Saà-Garriga et al., CAIAC (UAB)HIP3ES 1/28 A. Saà-Garriga, D. Castells-Rufas and J. Carrabina {Albert.saa, David.castells,

Slides:

Advertisements

Similar presentations

OpenMP Optimization National Supercomputing Service Swiss National Supercomputing Center.

Advertisements

MINJAE HWANG THAWAN KOOBURAT CS758 CLASS PROJECT FALL 2009 Extending Task-based Programming Model beyond Shared-memory Systems.

NPACI Parallel Computing Institute August 19-23, 2002 San Diego Supercomputing Center S an D IEGO S UPERCOMPUTER C ENTER N ATIONAL P ARTNERSHIP FOR A DVANCED.

Introduction to Openmp & openACC

A Process Splitting Transformation for Kahn Process Networks Sjoerd Meijer.

Software & Services Group, Developer Products Division Copyright© 2010, Intel Corporation. All rights reserved. *Other brands and names are the property.

Introductions to Parallel Programming Using OpenMP

Praveen Yedlapalli Emre Kultursay Mahmut Kandemir The Pennsylvania State University.

1 Lawrence Livermore National Laboratory By Chunhua (Leo) Liao, Stephen Guzik, Dan Quinlan A node-level programming model framework for exascale computing*

A Parallel Matching Algorithm Based on Image Gray Scale Liang Zong, Yanhui Wu cso, vol. 1, pp , 2009 International Joint Conference on Computational.

PARALLEL PROGRAMMING WITH OPENMP Ing. Andrea Marongiu

PARALLEL PROGRAMMING WITH OPENMP Ing. Andrea Marongiu

Cc Compiler Parallelization Options CSE 260 Mini-project Fall 2001 John Kerwin.

1 Tuesday, November 07, 2006 “If anything can go wrong, it will.” -Murphy’s Law.

DISTRIBUTED AND HIGH-PERFORMANCE COMPUTING CHAPTER 7: SHARED MEMORY PARALLEL PROGRAMMING.

Dr. Muhammed Al-Mulhem 1ICS ICS 535 Design and Implementation of Programming Languages Part 1 OpenMP -Example ICS 535 Design and Implementation.

Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Parallel Programming in C with MPI and OpenMP Michael J. Quinn.

Tile Reduction: the first step towards tile aware parallelization in OpenMP Ge Gan Department of Electrical and Computer Engineering Univ. of Delaware.

INTEL CONFIDENTIAL OpenMP for Domain Decomposition Introduction to Parallel Programming – Part 5.

Exercise problems for students taking the Programming Parallel Computers course. Janusz Kowalik Piotr Arlukowicz Tadeusz Puzniakowski Informatics Institute.

High Performance Computation --- A Practical Introduction Chunlin Tian NAOC Beijing 2011.

CS470/570 Lecture 5 Introduction to OpenMP Compute Pi example OpenMP directives and options.

CC02 – Parallel Programming Using OpenMP 1 of 25 PhUSE 2011 Aniruddha Deshmukh Cytel Inc.

Parallel Programming in Java with Shared Memory Directives.

A Prototypical Self-Optimizing Package for Parallel Implementation of Fast Signal Transforms Kang Chen and Jeremy Johnson Department of Mathematics and.

This module was created with support form NSF under grant # DUE Module developed by Martin Burtscher Module B1 and B2: Parallelization.

Chapter 17 Shared-Memory Programming. Introduction OpenMP is an application programming interface (API) for parallel programming on multiprocessors. It.

MPI3 Hybrid Proposal Description

OpenMP in a Heterogeneous World Ayodunni Aribuki Advisor: Dr. Barbara Chapman HPCTools Group University of Houston.

A Metadata Based Approach For Supporting Subsetting Queries Over Parallel HDF5 Datasets Vignesh Santhanagopalan Graduate Student Department Of CSE.

AN EXTENDED OPENMP TARGETING ON THE HYBRID ARCHITECTURE OF SMP-CLUSTER Author ： Y. Zhao 、 C. Hu 、 S. Wang 、 S. Zhang Source ： Proceedings of the 2nd IASTED.

ROBERT BOCCHINO, ET AL. UNIVERSAL PARALLEL COMPUTING RESEARCH CENTER UNIVERSITY OF ILLINOIS A Type and Effect System for Deterministic Parallel Java *Based.

Computer Organization David Monismith CS345 Notes to help with the in class assignment.

Message Passing Programming with MPI Introduction to MPI Basic MPI functions Most of the MPI materials are obtained from William Gropp and Rusty Lusk’s.

Hybrid MPI and OpenMP Parallel Programming

Work Replication with Parallel Region #pragma omp parallel { for ( j=0; j

Message Passing Programming Model AMANO, Hideharu Textbook pp. １４０－１４７.

Drew Freer, Beayna Grigorian, Collin Lambert, Alfonso Roman, Brian Soumakian.

Computer Science and Engineering Parallelizing Defect Detection and Categorization Using FREERIDE Leonid Glimcher P. 1 ipdps’05 Scaling and Parallelizing.

October 11, 2007 © 2007 IBM Corporation Multidimensional Blocking in UPC Christopher Barton, Călin Caşcaval, George Almási, Rahul Garg, José Nelson Amaral,

Parallelization of likelihood functions for data analysis Alfio Lazzaro CERN openlab Forum on Concurrent Programming Models and Frameworks.

Evaluating and Improving an OpenMP-based Circuit Design Tool Tim Beatty, Dr. Ken Kent, Dr. Eric Aubanel Faculty of Computer Science University of New Brunswick.

1 Message Passing Models CEG 4131 Computer Architecture III Miodrag Bolic.

Compiler and Runtime Support for Enabling Generalized Reduction Computations on Heterogeneous Parallel Configurations Vignesh Ravi, Wenjing Ma, David Chiu.

Introduction to OpenMP Eric Aubanel Advanced Computational Research Laboratory Faculty of Computer Science, UNB Fredericton, New Brunswick.

Efficient Parallel CKY Parsing on GPUs Youngmin Yi (University of Seoul) Chao-Yue Lai (UC Berkeley) Slav Petrov (Google Research) Kurt Keutzer (UC Berkeley)

Experiences with Achieving Portability across Heterogeneous Architectures Lukasz G. Szafaryn +, Todd Gamblin ++, Bronis R. de Supinski ++ and Kevin Skadron.

MPI and OpenMP.

Programming distributed memory systems: Message Passing Interface (MPI) Distributed memory systems: multiple processing units working on one task (e.g.

OpenMP for Networks of SMPs Y. Charlie Hu, Honghui Lu, Alan L. Cox, Willy Zwaenepoel ECE1747 – Parallel Programming Vicky Tsang.

AUTO-GC: Automatic Translation of Data Mining Applications to GPU Clusters Wenjing Ma Gagan Agrawal The Ohio State University.

Parallel Computing Presented by Justin Reschke

OpenMP Runtime Extensions Many core Massively parallel environment Intel® Xeon Phi co-processor Blue Gene/Q MPI Internal Parallelism Optimizing MPI Implementation.

CPE779: Shared Memory and OpenMP Based on slides by Laxmikant V. Kale and David Padua of the University of Illinois.

Heterogeneous Computing using openMP lecture 1 F21DP Distributed and Parallel Technology Sven-Bodo Scholz.

Introduction to OpenMP

Introduction to MPI.

MPI Message Passing Interface

Introduction to OpenMP

Shared-Memory Programming

Multi-core CPU Computing Straightforward with OpenMP

Message Passing Models

Hybrid Parallel Programming

Introduction to parallelism and the Message Passing Interface

Hybrid Parallel Programming

Introduction to OpenMP

CSCE569 Parallel Computing

Hybrid MPI and OpenMP Parallel Programming

Hybrid Parallel Programming

Presentation transcript:

OMP2MPI A. Saà-Garriga et al., CAIAC (UAB)HIP3ES 1/28 A. Saà-Garriga, D. Castells-Rufas and J. Carrabina {Albert.saa, David.castells, Microelectronic and Electronic Systems Department Universitat Autònoma de Barcelona. UAB 21/01/2015 OMP2MPI: Automatic MPI code generation from OpenMP programs OMP2MPI: Automatic MPI code generation from OpenMP programs

OMP2MPI A. Saà-Garriga et al., CAIAC (UAB)HIP3ES 2/ Introduction 2 2 OMP2MPI Compiler 3 3 Results 4 4 Conclusions Intro Compiler Results Conclusions

OMP2MPI A. Saà-Garriga et al., CAIAC (UAB)HIP3ES 3/ Introduction 2 2 OMP2MPI Compiler 3 3 Results 4 4 Conclusions Intro Compiler Results Conclusions

OMP2MPI A. Saà-Garriga et al., CAIAC (UAB)HIP3ES 4/28 OpenMP and MPI De-facto standards usually used for programming High Performance Computing applications(HPC)  MPI  Usually associated with large Distributed Memory Systems  …but implementations take profit of shared memory inside nodes  … and can also be used for distribute memory many-cores  Very intrusive immersed on the sequential code  OpenMP  Simple, easy to learn  Programmer is exposed to a shared memory  …usually so, but several options to extend it to different architectures  Harder to scale up efficiently Intro Compiler Results Conclusions

OMP2MPI A. Saà-Garriga et al., CAIAC (UAB)HIP3ES 5/28 Goal  Generate MPI from OpenMP  Go beyond the limits of shared memory (with MPI) while starting from an easy OpenMP source code  To use it in large supercomputers and distributed memory embedded systems (STORM) Intro Compiler Results Conclusions Tianhe-2 Supercomputer (DM) Current #1 in Top nodes – cores Bull Bullion Node (SM) 160 cores UAB’s FPGA based MPSoC (DM) / ocMPI 16 cores

OMP2MPI A. Saà-Garriga et al., CAIAC (UAB)HIP3ES 6/28Intro Compiler Results Conclusions 1 1 Introduction 2 2 OMP2MPI Compiler 3 3 Results 4 4 Conclusions

OMP2MPI A. Saà-Garriga et al., CAIAC (UAB)HIP3ES 7/28 OMP2MPI Intro Compiler Results Conclusions  Source to Source compiler  Based on Mercurium (BSC) compilation framework

OMP2MPI A. Saà-Garriga et al., CAIAC (UAB)HIP3ES 8/28 Input Source Code Intro Compiler Results Conclusions  We focus on  #pragma omp parallel for  reduction operations  We transform them into a MPI application using a reduced MPI subset (MPI_Init, MPI_Send, MPI_Receive, MPI_Finalize)  We support loops with bounded limits and constantly spaced iterations  Private variables are correctly handled by design  Shared variables are maintained by master node void main() {... #pragma omp parallel for target mpi for(int i = 0; i<N; ++i) { double x = (i+0.5) * step; sum[i] = 4.0/(1.0+x*x); } #pragma omp parallel for reduction(+:total) target mpi for (int j=0; j<N; ++j){ total += sum[j]; }... }

OMP2MPI A. Saà-Garriga et al., CAIAC (UAB)HIP3ES 9/28 Generate MPI Source Code Intro Compiler Results Conclusions  The main idea is to divide the OpenMP block into a master/slaves task.  MPI applications must be initialized and finalized  Rank 0 contains all the sequential code from the orginal OpenMP application

OMP2MPI A. Saà-Garriga et al., CAIAC (UAB)HIP3ES 10/28 Shared variable Analysis Intro Compiler Results Conclusions  For each of shared variables used inside an OpenMP block to transform OMP2MPI analyze the Abstract Syntax Tree to identify when/wether they are accessed  Depending on that information MPI_Send / MPI_Recv intructions are inserted to transfer the data to the appropiate slaves.

OMP2MPI A. Saà-Garriga et al., CAIAC (UAB)HIP3ES 11/28 Static Task division Intro Compiler Results Conclusions Master Slave 1Slave 2Slave N Iteration start OUT Variables Number of iterations IN/INOUT variables Iteration start Number of iterations IN/INOUT variables Iteration start Number of iterations IN/INOUT variables OUT Variables Iteration start Number of iterations IN/INOUT variables … … Iteration start Number of iterations IN/INOUT variables …  The outer loop is scheduled in round robin fashion by using MPI_Recv from specific ranks  Could lead to an unbalanced load

OMP2MPI A. Saà-Garriga et al., CAIAC (UAB)HIP3ES 12/28Intro Compiler Results Conclusions Master Slave 1Slave 2Slave N Iteration start Number of iterations IN/INOUT variables Iteration start Number of iterations IN/INOUT variables Iteration start Number of iterations IN/INOUT variables Data Iteration start Number of iterations IN/INOUT variables … Data Iteration start Number of iterations IN/INOUT variables … Static Task division  The outer loop is scheduled in by using ANY_SOURCE MPI_Recv  More efficient

OMP2MPI A. Saà-Garriga et al., CAIAC (UAB)HIP3ES 13/28Intro Compiler Results Conclusions 1 1 Introduction 2 2 OMP2MPI Compiler 3 3 Results 4 4 Conclusions

OMP2MPI A. Saà-Garriga et al., CAIAC (UAB)HIP3ES 14/28 Source Code Example Intro Compiler Results Conclusions void main() {... #pragma omp parallel for schedule(dynamic) target mpi for(int i = 0; i<N; ++i) { double x = (i+0.5) * step; sum[i] = 4.0/(1.0+x*x); } #pragma omp parallel for reduction(+:total) target mpi for (int j=0; j<N; ++j){ total += sum[j]; }... }

OMP2MPI A. Saà-Garriga et al., CAIAC (UAB)HIP3ES 15/28Intro Compiler Results Conclusions... const int FTAG = 0; const int ATAG = 1; int partSize = ((N- 0)) / (size - 1), offset; if (myid == 0) { int followIN = 0, killed = 0; for (int to = 1;to < size;++to) { MPI_Send(&followIN, 1, MPI_INT, to, ATAG, MPI_COMM_WORLD); MPI_Send(&partSize, 1, MPI_INT, to, ATAG, MPI_COMM_WORLD); followIN += partSize; } while (1) { MPI_Recv(&offset, 1, MPI_INT, MPI_ANY_SOURCE,...); int source = stat.MPI_SOURCE; MPI_Recv(&partSize, 1, MPI_INT, source,...); MPI_Recv(&sum[offset], partSize, MPI_DOUBLE, source,...); if (followIN > N ) { MPI_Send(&offset, 1, MPI_INT, source, FTAG,...); killed++; } else { partSize = min(partSize, N – followIN); MPI_Send(&followIN, 1, MPI_INT, source, ATAG,...); MPI_Send(&partSize, 1, MPI_INT, source, ATAG,...); } followIN += partSize; if (killed == size - 1) break; } else { while (1) { MPI_Recv(&offset, 1, MPI_INT, MPI_ANY_SOURCE, MPI_ANY_TAG,…); if (stat.MPI_TAG == ATAG) { MPI_Recv(&partSize, 1, MPI_INT, 0, MPI_ANY_TAG,…); for (int i = offset; i < offset + partSize; ++i) { double x = (i + 0.5) * step; sum[i] = 4.0 / (1.0 + x * x); } MPI_Send(&offset, 1, MPI_INT, 0, 0, …); MPI_Send(&partSize, 1, MPI_INT, 0, 0, …); MPI_Send(&sum[offset], partSize, MPI_DOUBLE, 0, 0, …); } else if (stat.MPI_TAG == FTAG) { break; } #pragma omp parallel for schedule(dynamic) target mpi for(int i = 0; i<N; ++i) { double x = (i+0.5) * step; sum[i] = 4.0/(1.0+x*x); }

OMP2MPI A. Saà-Garriga et al., CAIAC (UAB)HIP3ES 16/28 Source Code Example Intro Compiler Results Conclusions void main() {... #pragma omp parallel for target mpi for(int i = 0; i<N; ++i) { double x = (i+0.5) * step; sum[i] = 4.0/(1.0+x*x); } #pragma omp parallel for reduction(+:total) schedule(dynamic) target mpi for (int j=0; j<N; ++j){ total += sum[j]; }... }

OMP2MPI A. Saà-Garriga et al., CAIAC (UAB)HIP3ES 17/28Intro Compiler Results Conclusions double work0; int j = 0; partSize = ((N - 0)) / (size - 1); if (myid == 0) { int followIN = 0; int killed = 0; for (int to = 1;to < size; ++to) { MPI_Send(&followIN, 1, MPI_INT, to, ATAG,…); MPI_Send(&partSize, 1, MPI_INT, to, ATAG, …); followIN += partSize; } while (1) { MPI_Recv(&offset, 1, MPI_INT, MPI_ANY_SOURCE, …); int source = stat.MPI_SOURCE; MPI_Recv(&partSize, 1, MPI_INT, source, MPI_ANY_TAG,…); MPI_Recv(&work0, 1, MPI_DOUBLE, source, MPI_ANY_TAG, …); total += work0; if (followIN > N ) { MPI_Send(&offset, 1, MPI_INT, source, FTAG,...); killed++; } else { partSize = min(partSize, N – followIN); MPI_Send(&followIN, 1, MPI_INT, source, ATAG,...); MPI_Send(&partSize, 1, MPI_INT, source, ATAG,...); } followIN += partSize; if (killed == size - 1) break; } if (myid != 0) { while (1) { MPI_Recv(&offset, 1, MPI_INT, MPI_ANY_SOURCE, …); if (stat.MPI_TAG == ATAG) { MPI_Recv(&partSize, 1, MPI_INT, 0, MPI_ANY_TAG, …); total = 0; for (int j = offset; j < offset + partSize; ++j) { total += sum[j]; } MPI_Send(&offset, 1, MPI_INT, 0, 0, MPI_COMM_WORLD); MPI_Send(&partSize, 1, MPI_INT, 0, 0, MPI_COMM_WORLD); MPI_Send(&total, 1, MPI_DOUBLE, 0, 0, MPI_COMM_WORLD); } else if (stat.MPI_TAG == FTAG) break; } #pragma omp parallel for reduction(+:total) schedule(dynamic) target mpi for (int j=0; j<N; ++j){ total += sum[j]; }

OMP2MPI A. Saà-Garriga et al., CAIAC (UAB)HIP3ES 18/28 Experimental Results Intro Compiler Results Conclusions  Experiment Characteristics  Sequential, OpenMP and MPI(bullxmpi)  64 cpus E with 2.40 GHz(Bullion quadri module)  Scalability chart with 16, 32 and 64 cores  Test made using a subset of the Polybench Benchmark

OMP2MPI A. Saà-Garriga et al., CAIAC (UAB)HIP3ES 19/28 Experimental Results Intro Compiler Results Conclusions GEMM2MM TRMM SYR2K SYRK MVT

OMP2MPI A. Saà-Garriga et al., CAIAC (UAB)HIP3ES 20/28 Experimental Results Intro Compiler Results Conclusions SEIDEL LUDCMP JACOBI 2DCOVARIANCE CORRELATIONCONVOLUTION

OMP2MPI A. Saà-Garriga et al., CAIAC (UAB)HIP3ES 21/28Intro Compiler Results Conclusions 1 1 Introduction 2 2 OMP2HMPP Compiler 3 3 Results 4 4 Conclusions

OMP2MPI A. Saà-Garriga et al., CAIAC (UAB)HIP3ES 22/28 Conclusions  The programmer avoid to expend time in learning MPI functions.  Tested set of problems from Polybench[8] obtains in most of cases with more than 20x of speedup for 64 cores compared to the sequential version.  An average speedup over 4x compared to OpenMP.  OMP2MPI gives a solution that allow further optimizations by an expert that want to achieve better results.  OMP2MPI automatically genarates MPI source code. Allowing that the program exploits non shared-memory architectures such as cluster, or Network-on-Chip based(NoC-based) Multiprocessors- System-onChip (MPSoC). …thanks for your attention! Intro Compiler Results Conclusions

OMP2MPI A. Saà-Garriga et al., CAIAC (UAB)HIP3ES 23/28Intro Compiler Results Conclusions Thanks for your attention

OMP2MPI A. Saà-Garriga et al., CAIAC (UAB)HIP3ES 24/28 Current Limitations Intro Compiler Results Conclusions  Complex for loops are not supported by OMP2MPI #pragma omp parallel for for(int i=0; i<100; i+= cos(i)) { … }  The step is not constant on iterations

OMP2MPI A. Saà-Garriga et al., CAIAC (UAB)HIP3ES 25/28 Shared Memory Handling Intro Compiler Results Conclusions var[i][j] = 2*i; var[i] = j*2;

OMP2MPI A. Saà-Garriga et al., CAIAC (UAB)HIP3ES 26/28 Shared Memory Handling Intro Compiler Results Conclusions

OMP2MPI A. Saà-Garriga et al., CAIAC (UAB)HIP3ES 27/28 Example of Current Limitations Intro Compiler Results Conclusions  Iterator in second index array access  Concurrent access to shared variable #pragma omp parallel for for(int i=0; i<100; i++) { for(int j=0; j<100; j++) { var[j] = var[i]*2; } #pragma omp parallel for for(int i=0; i<100; i++) { for(int j=0; j<100; j++) { var[j][i] = 2*j; }