Using OpenMP offloading in Charm++

Slides:



Advertisements
Similar presentations
Prasanna Pandit R. Govindarajan
Advertisements

Overview Motivation Scala on LLVM Challenges Interesting Subsets.
Introductions to Parallel Programming Using OpenMP
GPU System Architecture Alan Gray EPCC The University of Edinburgh.
University of Houston So What’s Exascale Again?. University of Houston The Architects Did Their Best… Scale of parallelism Multiple kinds of parallelism.
Presented by Rengan Xu LCPC /16/2014
Software Group © 2006 IBM Corporation Compiler Technology Task, thread and processor — OpenMP 3.0 and beyond Guansong Zhang, IBM Toronto Lab.
Contemporary Languages in Parallel Computing Raymond Hummel.
HPCC Mid-Morning Break Dirk Colbry, Ph.D. Research Specialist Institute for Cyber Enabled Discovery Introduction to the new GPU (GFX) cluster.
Jared Barnes Chris Jackson.  Originally created to calculate pixel values  Each core executes the same set of instructions Mario projected onto several.
Shekoofeh Azizi Spring  CUDA is a parallel computing platform and programming model invented by NVIDIA  With CUDA, you can send C, C++ and Fortran.
CS470/570 Lecture 5 Introduction to OpenMP Compute Pi example OpenMP directives and options.
OMPi: A portable C compiler for OpenMP V2.0 Elias Leontiadis George Tzoumas Vassilios V. Dimakopoulos University of Ioannina.
Programming GPUs using Directives Alan Gray EPCC The University of Edinburgh.
Lecture 8: Caffe - CPU Optimization
OpenMP in a Heterogeneous World Ayodunni Aribuki Advisor: Dr. Barbara Chapman HPCTools Group University of Houston.
SC’13: Hands-on Practical Hybrid Parallel Application Performance Engineering1 Score-P Hands-On CUDA: Jacobi example.
BY: ALI AJORIAN ISFAHAN UNIVERSITY OF TECHNOLOGY 2012 GPU Architecture 1.
Revisiting Kirchhoff Migration on GPUs Rice Oil & Gas HPC Workshop
OpenMP – Introduction* *UHEM yaz çalıştayı notlarından derlenmiştir. (uhem.itu.edu.tr)
CUDA All material not from online sources/textbook copyright © Travis Desell, 2012.
Profiling and Tuning OpenACC Code. Profiling Tools (PGI) Use time option to learn where time is being spent -ta=nvidia,time NVIDIA Visual Profiler 3 rd.
GPU Architecture and Programming
1 The Portland Group, Inc. Brent Leback HPC User Forum, Broomfield, CO September 2009.
GPUs: Overview of Architecture and Programming Options Lee Barford firstname dot lastname at gmail dot com.
OpenCL Programming James Perry EPCC The University of Edinburgh.
Experiences with Achieving Portability across Heterogeneous Architectures Lukasz G. Szafaryn +, Todd Gamblin ++, Bronis R. de Supinski ++ and Kevin Skadron.
Threaded Programming Lecture 2: Introduction to OpenMP.
© 2008, Renesas Technology America, Inc., All Rights Reserved 1 Introduction Purpose  This training course explains how to use section setting and memory.
Contemporary Languages in Parallel Computing Raymond Hummel.
Shangkar Mayanglambam, Allen D. Malony, Matthew J. Sottile Computer and Information Science Department Performance.
CS/EE 217 GPU Architecture and Parallel Programming Lecture 23: Introduction to OpenACC.
Heterogeneous Computing With GPGPUs Matthew Piehl Overview Introduction to CUDA Project Overview Issues faced nvcc Implementation Performance Metrics Conclusions.
How to use HybriLIT Matveev M. A., Zuev M.I. Heterogeneous Computations team HybriLIT Laboratory of Information Technologies (LIT), Joint Institute for.
CPE779: Shared Memory and OpenMP Based on slides by Laxmikant V. Kale and David Padua of the University of Illinois.
GPU Computing for GIS James Mower Department of Geography and Planning University at Albany.
Heterogeneous Processing KYLE ADAMSKI. Overview What is heterogeneous processing? Why it is necessary Issues with heterogeneity CPU’s vs. GPU’s Heterogeneous.
11 Brian Van Straalen Portable Performance Discussion August 7, FASTMath SciDAC Institute.
Matthew Royle Supervisor: Prof Shaun Bangay.  How do we implement OpenCL for CPUs  Differences in parallel architectures  Is our CPU implementation.
Martin Kruliš by Martin Kruliš (v1.1)1.
OpenMP Lab Antonio Gómez-Iglesias Texas Advanced Computing Center.
Jun Doi IBM Research – Tokyo Early Performance Evaluation of Lattice QCD on POWER+GPU Cluster 17 July 2015.
Comparison of Threading Programming Models
Prof. Zhang Gang School of Computer Sci. & Tech.
Intel Many Integrated Cores Architecture
NFV Compute Acceleration APIs and Evaluation
Gwangsun Kim, Jiyun Jeong, John Kim
Introduction to OpenMP
SHARED MEMORY PROGRAMMING WITH OpenMP
Our Graphics Environment
GPUs: Not Just for Graphics Anymore
Enabling machine learning in embedded systems
Exploiting NVIDIA GPUs with OpenMP
OpenMP Quiz B. Wilkinson January 22, 2016.
Heterogeneous Computing with D
CS 179: GPU Programming Lecture 1: Introduction 1
Advanced TAU Commander
Experience with Maintaining the GPU Enabled Version of COSMO
Multi-core CPU Computing Straightforward with OpenMP
Integrated Runtime of Charm++ and OpenMP
General Programming on Graphical Processing Units
General Programming on Graphical Processing Units
Lab. 3 (May 11th) You may use either cygwin or visual studio for using OpenMP Compiling in cygwin “> gcc –fopenmp ex1.c” will generate a.exe Execute :
Introduction to CUDA.
OpenMP Quiz.
OpenMP on HiHAT James Beyer, 18 Sep 2017.
Multicore and GPU Programming
6- General Purpose GPU Programming
Multicore and GPU Programming
CUDA Fortran Programming with the IBM XL Fortran Compiler
Presentation transcript:

Using OpenMP offloading in Charm++ Matthias Diener Charm++ Workshop 2018

OpenMP on accelerators Heterogeneous architectures (CPU + Accelerator) are becoming common Main question: how do we use accelerators? Traditionally: Cuda, OpenCL, … OpenMP is an interesting option Supports offloading to accelerators since version 4.0 No code duplication Use standard languages Target different types of accelerators

General overview – ZAXPY in OpenMP CPU double x[N], y[N], z[N], a; //calculate z[i]=a*x[i]+y[i] #pragma omp parallel for for (int i=0; i<N; i++) z[i] = a*x[i] + y[i];

General overview – ZAXPY in OpenMP GPU Compiler: Generate code for GPU double x[N], y[N], z[N], a; //calculate z=a*x+y #pragma omp target { #pragma omp for for (int i=0; i<N; i++) z[i] = a*x[i] + y[i]; } Runtime: Run code on device if possible, copy data from/to GPU Code is unmodified except for the pragma Data is implicitly copied All calculation done on device

Compiler support Compiler OpenMP offload version Device types Gcc 4.5 Nvidia GPU, Xeon Phi Clang Nvidia GPU, AMD GPU Flang n/a icc Xeon Phi Cray cc 4.0 Nvidia GPU IBM xl PGI Limitations: Static linking only Recent linker No C++ exceptions Not all operations offloadable (e.g., I/O, network, …)

Performance results – K40 , gcc 7.3

Performance results – V100 , xl 13.1.7 beta2

Using OpenMP offloading in Charm++/AMPI

Using OpenMP offloading in Charm++ Current Charm++ includes LLVM-based OpenMP, but currently without offloading Build Charm++ as usual Build with offloading enabled compiler Do not specify “omp” option No need to add –fopenmp (or similar) options Application Can use OpenMP pragmas directly Need to take care of data consistency for migration Compile with charmc/ampicc with compiler’s OpenMP/offloading option charmc -fopenmp file.cpp charmc -qsmp -qoffload file.cpp

Example – Jacobi3D Modified Jacobi3D application to use OpenMP Run on Ray machine (Power8 + P100), XL 13.1.7 b2 Two input sets: small (100*100*100), large (1000*100*100)

Nvidia Visual Profiler

Conclusions and next steps OpenMP provides a simple way to use accelerators Reasonable performance on GPUs compared to Cuda Main challenge: comprehensive compiler support Can be used easily in Charm++/AMPI Next steps Extend integrated LLVM-OpenMP to support offloading Interface with GPU Manager

Questions?