Experiences with Achieving Portability across Heterogeneous Architectures Lukasz G. Szafaryn +, Todd Gamblin ++, Bronis R. de Supinski ++ and Kevin Skadron.

Slides:

Advertisements

Similar presentations

Prasanna Pandit R. Govindarajan

Advertisements

GPU Computing with OpenACC Directives. subroutine saxpy(n, a, x, y) real :: x(:), y(:), a integer :: n, i $!acc kernels do i=1,n y(i) = a*x(i)+y(i) enddo.

Scalable Multi-Cache Simulation Using GPUs Michael Moeng Sangyeun Cho Rami Melhem University of Pittsburgh.

May 2, 2015©2006 Craig Zilles1 (Easily) Exposing Thread-level Parallelism  Previously, we introduced Multi-Core Processors —and the (atomic) instructions.

GPGPU Introduction Alan Gray EPCC The University of Edinburgh.

LLNL-PRES This work was performed under the auspices of the U.S. Department of Energy by Lawrence Livermore National Laboratory under contract DE-AC52-07NA27344.

1 Lawrence Livermore National Laboratory By Chunhua (Leo) Liao, Stephen Guzik, Dan Quinlan A node-level programming model framework for exascale computing*

PARALLEL PROCESSING COMPARATIVE STUDY 1. CONTEXT How to finish a work in short time???? Solution To use quicker worker. Inconvenient: The speed of worker.

OpenFOAM on a GPU-based Heterogeneous Cluster

Revisiting a slide from the syllabus: CS 525 will cover Parallel and distributed computing architectures – Shared memory processors – Distributed memory.

Presented by Rengan Xu LCPC /16/2014

1 Tuesday, November 07, 2006 “If anything can go wrong, it will.” -Murphy’s Law.

CUDA Programming Lei Zhou, Yafeng Yin, Yanzhi Ren, Hong Man, Yingying Chen.

Contemporary Languages in Parallel Computing Raymond Hummel.

A Source-to-Source OpenACC compiler for CUDA Akihiro Tabuchi †1 Masahiro Nakao †2 Mitsuhisa Sato †1 †1. Graduate School of Systems and Information Engineering,

To GPU Synchronize or Not GPU Synchronize? Wu-chun Feng and Shucai Xiao Department of Computer Science, Department of Electrical and Computer Engineering,

ORIGINAL AUTHOR JAMES REINDERS, INTEL PRESENTED BY ADITYA AMBARDEKAR Overview for Intel Xeon Processors and Intel Xeon Phi coprocessors.

Accelerating SQL Database Operations on a GPU with CUDA Peter Bakkum & Kevin Skadron The University of Virginia GPGPU-3 Presentation March 14, 2010.

Introduction to Symmetric Multiprocessors Süha TUNA Bilişim Enstitüsü UHeM Yaz Çalıştayı

SEC(R) 2008 Intel® Concurrent Collections for C++ - a model for parallel programming Nikolay Kurtov Software and Services.

OpenCL Introduction A TECHNICAL REVIEW LU OCT

Efficient Parallel Implementation of Molecular Dynamics with Embedded Atom Method on Multi-core Platforms Reporter: Jilin Zhang Authors:Changjun Hu, Yali.

Prospector : A Toolchain To Help Parallel Programming Minjang Kim, Hyesoon Kim, HPArch Lab, and Chi-Keung Luk Intel This work will be also supported by.

An approach for solving the Helmholtz Equation on heterogeneous platforms An approach for solving the Helmholtz Equation on heterogeneous platforms G.

OpenMP in a Heterogeneous World Ayodunni Aribuki Advisor: Dr. Barbara Chapman HPCTools Group University of Houston.

COLLABORATIVE EXECUTION ENVIRONMENT FOR HETEROGENEOUS PARALLEL SYSTEMS Aleksandar Ili´c, Leonel Sousa 2010 IEEE International Symposium on Parallel & Distributed.

Predictive Runtime Code Scheduling for Heterogeneous Architectures 1.

CuMAPz: A Tool to Analyze Memory Access Patterns in CUDA

Parallel Programming Models Jihad El-Sana These slides are based on the book: Introduction to Parallel Computing, Blaise Barney, Lawrence Livermore National.

GPUs and Accelerators Jonathan Coens Lawrence Tan Yanlin Li.

CCA Common Component Architecture Manoj Krishnan Pacific Northwest National Laboratory MCMD Programming and Implementation Issues.

Advisor: Dr. Aamir Shafi Co-Advisor: Mr. Ali Sajjad Member: Dr. Hafiz Farooq Member: Mr. Tahir Azim Optimizing N-body Simulations for Multi-core Compute.

AN EXTENDED OPENMP TARGETING ON THE HYBRID ARCHITECTURE OF SMP-CLUSTER Author ： Y. Zhao 、 C. Hu 、 S. Wang 、 S. Zhang Source ： Proceedings of the 2nd IASTED.

Porting Irregular Reductions on Heterogeneous CPU-GPU Configurations Xin Huo, Vignesh T. Ravi, Gagan Agrawal Department of Computer Science and Engineering.

CS 838: Pervasive Parallelism Introduction to OpenMP Copyright 2005 Mark D. Hill University of Wisconsin-Madison Slides are derived from online references.

Work Replication with Parallel Region #pragma omp parallel { for ( j=0; j

Automatic translation from CUDA to C++ Luca Atzori, Vincenzo Innocente, Felice Pantaleo, Danilo Piparo 31 August, 2015.

GPU Architecture and Programming

Supercomputing ‘99 Parallelization of a Dynamic Unstructured Application using Three Leading Paradigms Leonid Oliker NERSC Lawrence Berkeley National Laboratory.

Autonomic scheduling of tasks from data parallel patterns to CPU/GPU core mixes Published in: High Performance Computing and Simulation (HPCS), 2013 International.

Experiences Accelerating MATLAB Systems Biology Applications Heart Wall Tracking Lukasz Szafaryn, Kevin Skadron University of Virginia.

Parallelization of likelihood functions for data analysis Alfio Lazzaro CERN openlab Forum on Concurrent Programming Models and Frameworks.

Transactional Coherence and Consistency Presenters: Muhammad Mohsin Butt. (g ) Coe-502 paper presentation 2.

Compiler and Runtime Support for Enabling Generalized Reduction Computations on Heterogeneous Parallel Configurations Vignesh Ravi, Wenjing Ma, David Chiu.

Department of Computer Science MapReduce for the Cell B. E. Architecture Marc de Kruijf University of Wisconsin−Madison Advised by Professor Sankaralingam.

Threaded Programming Lecture 2: Introduction to OpenMP.

Euro-Par, 2006 ICS 2009 A Translation System for Enabling Data Mining Applications on GPUs Wenjing Ma Gagan Agrawal The Ohio State University ICS 2009.

MATE-CG: A MapReduce-Like Framework for Accelerating Data-Intensive Computations on Heterogeneous Clusters Wei Jiang and Gagan Agrawal.

Shangkar Mayanglambam, Allen D. Malony, Matthew J. Sottile Computer and Information Science Department Performance.

CS/EE 217 GPU Architecture and Parallel Programming Lecture 23: Introduction to OpenACC.

Single Node Optimization Computational Astrophysics.

AUTO-GC: Automatic Translation of Data Mining Applications to GPU Clusters Wenjing Ma Gagan Agrawal The Ohio State University.

3/12/2013Computer Engg, IIT(BHU)1 OpenMP-1. OpenMP is a portable, multiprocessing API for shared memory computers OpenMP is not a “language” Instead,

EU-Russia Call Dr. Panagiotis Tsarchopoulos Computing Systems ICT Programme European Commission.

My Coordinates Office EM G.27 contact time:

CPE779: Shared Memory and OpenMP Based on slides by Laxmikant V. Kale and David Padua of the University of Illinois.

Heterogeneous Computing using openMP lecture 1 F21DP Distributed and Parallel Technology Sven-Bodo Scholz.

Accelerating K-Means Clustering with Parallel Implementations and GPU Computing Janki Bhimani Miriam Leeser Ningfang Mi

Heterogeneous Processing KYLE ADAMSKI. Overview What is heterogeneous processing? Why it is necessary Issues with heterogeneity CPU’s vs. GPU’s Heterogeneous.

Martin Kruliš by Martin Kruliš (v1.1)1.

1 ”MCUDA: An efficient implementation of CUDA kernels for multi-core CPUs” John A. Stratton, Sam S. Stone and Wen-mei W. Hwu Presentation for class TDT24,

Relational Query Processing on OpenCL-based FPGAs Zeke Wang, Johns Paul, Hui Yan Cheah (NTU, Singapore), Bingsheng He (NUS, Singapore), Wei Zhang (HKUST,

Computer Engg, IIT(BHU)

Parallel Programming By J. H. Wang May 2, 2017.

Computer Engg, IIT(BHU)

Accelerating MapReduce on a Coupled CPU-GPU Architecture

Cristiano Padrin (CASPUR)

Introduction to CUDA.

A Parallelism Profiler with What-If Analyses for OpenMP Programs

Parallel Programming in C with MPI and OpenMP

Presentation transcript:

Experiences with Achieving Portability across Heterogeneous Architectures Lukasz G. Szafaryn +, Todd Gamblin ++, Bronis R. de Supinski ++ and Kevin Skadron + + University of Virginia ++ Lawrence Livermore National Laboratory

Outline Introduction Portability Problems Approach Molecular Dynamics Code Feature Tracking Code Conclusions Future Work 2

Introduction Cluster nodes are becoming heterogeneous (multi-core CPU and GPU) in order to increase their compute capability Codes executing at the level of a node need to be ported to new architectures/ accelerators for better performance Most existing codes are written for sequential execution, over-optimized for a particular architecture and difficult to port We look at 2 representative applications, and for each one of them, we want to arrive at and maintain a single code base that can run on different architectures source code 3

Outline Introduction Portability Problems Approach Molecular Dynamics Code Feature Tracking Code Conclusions Future Work 4

Portability Problems Code Structure Sequential form – Does not express parallelism well; needs effort to extract – Many small, independent loops and tasks that could run concurrently – Optimizations such as function inlining and loop unrolling Monolithic and coupled structure – Hard to identify sections that can be parallelized (and/or offloaded) – Hard to determine range of accessed data due to nested pointer structure Architecture-specific optimizations – Specifying native instructions to ensure desired execution – Memory alignment via padding – Orchestrating computation according to cache line size Prospect of significant code changes delays porting 5

Portability Problems Parallel Framework Inconvenient and inadequate current frameworks – Cilk, TBB – API with runtime calls, work only with CPU – OpenMP - convenient annotation-based frameworks, works only with CPU – OpenCL - portable CPU-GPU API with runtime calls, low-level and too difficult to learn – CUDA, CUDAlite - convenient GPU API with runtime calls, vendor-specific – Cetus - OpenMP-CUDA translators, limited to OpenMP, heavily relies on code analysis – PGI Accelerator API – convenient annotation-based framework, works only with GPU, relies on code analysis – Future OpenMP with accelerator support – convenient annotation- based framework, expected to fully support CPU and GPU 6

Portability Problems Summary Difficulties -Achieving portable code written in convenient framework is not possible -Porting across architectures requires earning a new, usually architecture-specific, language; scientists are reluctant to do that -Need to convert, usually over-optimized architecture-specific, serial code to parallel code; only skilled programmers can do that Outcomes -Degraded performance after initial porting, difficult to extract parallelism for improvement -Divergent code bases that are difficult to maintain -Instead, researchers will continue using current languages and corresponding architectures as long as they providing at least some incremental benefits 7

Outline Introduction Portability Problems Approach Molecular Dynamics Code Feature Tracking Code Conclusions Future Work 8

Approach Code Structure Clearly expose parallel characteristics of the code to facilitate portability and tradeoffs across architectures Create consolidated loop structure that maximizes the width of available parallelism Organize code in a modular fashion that clearly delineates parallel sections of the code Remove architecture-specific optimizations … for(…){ (independent iter.) } dependentTask; for(…){ (independent iter.) } dependentTask; arch-specificCode; … for(…){ (all independent iter.) } dependentTask; genericCode; … 9

Approach Parallel Framework Use portable directive-based approach to – Be compatible with current languages – Minimize changes to code – Eliminate run-time calls in favor of annotations Achieve best-effort parallelization with high-level directives Tune performance with optional lower-level directives … dependentTask; # pragma_for parallel for(…){ # pragma_for vector for(…){ (independent iter.) } dependentTask; … dependentTask; … deviceAllocate(); deviceTransfer(); kernel(); deviceTransferBack(); deviceFree(); … dependentTask; … kernel{ (independent iter.) } 10

Approach Portable Implementation Demonstrate approach for writing single code with syntactic and performance portability Follow approach of PGI Accelerator API and create higher-level generic framework that can be translated to OpenMP and PGI Accelerator API #define MCPU // or GPU dependentTask; # pragma api for parallel for(…){ # pragma api for vector for(…){ (independent iter.) } dependentTask; … dependentTask; # pragma omp parallel for for(…){ (independent iter.) } dependentTask; … dependentTask; # pragma acc for parallel for(…){ # pragma acc for vector for(…){ (independent iter.) } dependentTask; … Our generic framework OpenMP PGI Accelerator API Source translation 11

Outline Introduction Portability Problems Approach Molecular Dynamics Code Feature Tracking Code Conclusions Future Work 12

Molecular Dynamics Code Overview Particle space is divided into large boxes (or cells, could be any shape), where each box is divided into small boxes Code computes interactions between particles in any given small box and its neighbor small boxes Select neighbor box Select home particle Interactions with neighbor particles # of home particles # of home boxes # of neighbor particles # of neighbor boxes Inside a large box, small boxes with their neighbors are processed in parallel by a multi-core CPU or GPU Large boxes are assigned to different cluster nodes for parallel computation 13

Molecular Dynamics Code Code Translation #define MCPU // or GPU #pragma api data copyin(box[0:#_boxes-1]) \ copyin(pos[0:#_par.-1]) copyin(chr[0:#_par.-1]) \ copyout(dis[0:#_par.-1]){ #pragma api compute{ #pragma api parallel(30) private(…) \ cache(home_box) for(i=0; i<#_home_boxes; i++){ Home box setup #pragma api sequential for(j=0; j<#_neighbor_boxes; j++){ Neighbor box setup pragma api vector (128) private(…) \ cache(neighbor_box) for(k=0; k<#_home_particles; k++){ pragma api sequential for(l=0; l<#_neighbor_particles;l++){ Calculation of interactions } … } Our generic framework 14

Molecular Dynamics Code Code Translation #define MCPU // or GPU #pragma api data copyin(box[0:#_boxes-1]) \ copyin(pos[0:#_par.-1]) copyin(chr[0:#_par.-1]) \ copyout(dis[0:#_par.-1]){ #pragma api compute{ #pragma api parallel(30) private(…) \ cache(home_box) for(i=0; i<#_home_boxes; i++){ Home box setup #pragma api sequential for(j=0; j<#_neighbor_boxes; j++){ Neighbor box setup pragma api vector (128) private(…) \ cache(neighbor_box) for(k=0; k<#_home_particles; k++){ pragma api sequential for(l=0; l<#_neighbor_particles;l++){ Calculation of interactions } … } omp_set_num_threads(30); #pragma omp parallel for for(i=0; i<#_home_boxes; i++){ Home box setup for(j=0; j<#_neighbor_boxes; j++){ Neighbor box setup for(k=0; k<#_home_particles; k++){ for(l=0; l<#_neighbor_particles;l++){ Calculation of interactions } … } Source translation Our generic framework OpenMP 15

Molecular Dynamics Code Code Translation #define MCPU // or GPU #pragma api data copyin(box[0:#_boxes-1]) \ copyin(pos[0:#_par.-1]) copyin(chr[0:#_par.-1]) \ copyout(dis[0:#_par.-1]){ #pragma api compute{ #pragma api parallel(30) private(…) \ cache(home_box) for(i=0; i<#_home_boxes; i++){ Home box setup #pragma api sequential for(j=0; j<#_neighbor_boxes; j++){ Neighbor box setup pragma api vector (128) private(…) \ cache(neighbor_box) for(k=0; k<#_home_particles; k++){ pragma api sequential for(l=0; l<#_neighbor_particles;l++){ Calculation of interactions } … } omp_set_num_threads(30); #pragma omp parallel for for(i=0; i<#_home_boxes; i++){ Home box setup for(j=0; j<#_neighbor_boxes; j++){ Neighbor box setup for(k=0; k<#_home_particles; k++){ for(l=0; l<#_neighbor_particles;l++){ Calculation of interactions } … } Source translation Our generic framework OpenMP 16

Molecular Dynamics Code Code Translation #define MCPU // or GPU #pragma api data copyin(box[0:#_boxes-1]) \ copyin(pos[0:#_par.-1]) copyin(chr[0:#_par.-1]) \ copyout(dis[0:#_par.-1]){ #pragma api compute{ #pragma api parallel(30) private(…) \ cache(home_box) for(i=0; i<#_home_boxes; i++){ Home box setup #pragma api sequential for(j=0; j<#_neighbor_boxes; j++){ Neighbor box setup pragma api ector (128) private(…) \ cache(neighbor_box) for(k=0; k<#_home_particles; k++){ pragma api sequential for(l=0; l<#_neighbor_particles;l++){ Calculation of interactions } … } omp_set_num_threads(30); #pragma omp parallel for for(i=0; i<#_home_boxes; i++){ Home box setup for(j=0; j<#_neighbor_boxes; j++){ Neighbor box setup for(k=0; k<#_home_particles; k++){ for(l=0; l<#_neighbor_particles;l++){ Calculation of interactions } … } Source translation Our generic framework OpenMP PGI Accelerator API #pragma acc data region copyin(box[0:#_boxes-1]) \ copyin(pos[0:#_par.-1]) copyin(chr[0:#_par.-1]) \ copyout(dis[0:#_par.-1]){ #pragma acc region{ #pragma acc parallel(30) independent private(…) \ cache(home_box) for(i=0; i<#_home_boxes; i++){ Home box setup #pragma acc sequential for(j=0; j<#_neighbor_boxes; j++){ Neighbor box setup pragma acc vector (128) private(…) \ cache(neighbor_box) for(k=0; k<#_home_particles; k++){ pragma acc sequential for(l=0; l<#_neighbor_particles;l++){ Calculation of interactions } … } 17

Molecular Dynamics Code Code Translation #define MCPU // or GPU #pragma api data copyin(box[0:#_boxes-1]) \ copyin(pos[0:#_par.-1]) copyin(chr[0:#_par.-1]) \ copyout(dis[0:#_par.-1]){ #pragma api compute{ #pragma api parallel(30) private(…) \ cache(home_box) for(i=0; i<#_home_boxes; i++){ Home box setup #pragma api sequential for(j=0; j<#_neighbor_boxes; j++){ Neighbor box setup pragma api vector (128) private(…) \ cache(neighbor_box) for(k=0; k<#_home_particles; k++){ pragma api sequential for(l=0; l<#_neighbor_particles;l++){ Calculation of interactions } … } #pragma acc data region copyin(box[0:#_boxes-1]) \ copyin(pos[0:#_par.-1]) copyin(chr[0:#_par.-1]) \ copyout(dis[0:#_par.-1]){ #pragma acc region{ #pragma acc parallel(30) independent private(…) \ cache(home_box) for(i=0; i<#_home_boxes; i++){ Home box setup #pragma acc sequential for(j=0; j<#_neighbor_boxes; j++){ Neighbor box setup pragma acc vector (128) private(…) \ cache(neighbor_box) for(k=0; k<#_home_particles; k++){ pragma acc sequential for(l=0; l<#_neighbor_particles;l++){ Calculation of interactions } … } omp_set_num_threads(30); #pragma omp parallel for for(i=0; i<#_home_boxes; i++){ Home box setup for(j=0; j<#_neighbor_boxes; j++){ Neighbor box setup for(k=0; k<#_home_particles; k++){ for(l=0; l<#_neighbor_particles;l++){ Calculation of interactions } … } Source translation Our generic framework OpenMP PGI Accelerator API 18

Molecular Dynamics Code Workload Mapping home box – home box particle interactions SYNC home box – neighbor box 1 particle interactions SYNC home box – neighbor box 2 particle interactions SYNC … home box – neighbor box n particle interactions time Boxes … 12330… global SYNC Time step 1 processors Time step 2 home box – home box particle interactions SYNC home box – neighbor box 1 particle interactions SYNC home box – neighbor box 2 particle interactions SYNC … home box – neighbor box n particle interactions home box – home box particle interactions SYNC home box – neighbor box 1 particle interactions SYNC home box – neighbor box 2 particle interactions SYNC … home box – neighbor box n particle interactions home box – home box particle interactions SYNC home box – neighbor box 1 particle interactions SYNC home box – neighbor box 2 particle interactions SYNC … home box – neighbor box n particle interactions vector units 1, 2, 3 … n vector units 1, 2, 3 … n vector units 1, 2, 3 … n vector units 1, 2, 3 … n 19

Molecular Dynamics Code Performance Our implementation achieves over 50% reduction in code length compared to OpenMP+CUDA GPU performance of our implementation is within 17% of CUDA Arch. Code Feature Framework Kernel Length [lines] Exec. Time [s] Speedup [x] 1 1-core CPU OriginalC core CPU StructuredC core CPU StructuredOpenMP GPU GPU- specific CUDA core CPU Structured Portable Our Framework core CPU Structured Portable Our Framework GPU Structured Portable Our Framework GPU Init/Trans Overhead (CUDA)

Outline Introduction Portability Problems Approach Molecular Dynamics Code Feature Tracking Code Conclusions Future Work 21

Conclusions A common high-level directive-based framework can support efficient execution across architecture types Correct use of this framework with the support of the current state-of-the-art parallelizing compilers can yield comparable performance to custom, low- level code Our approach results in increased programmability across architectures and decreased code maintenance cost 22

Outline Introduction Portability Problems Approach Molecular Dynamics Code Feature Tracking Code Conclusions Future Work 23

Future Work Test our solution with a larger number of applications Propose several annotations that allow: – Assignment to particular devices – Concurrent execution of loops, each limited to a subset of processing elements Extend translator with code analysis and code translation to lower-level language (CUDA) to implement these features for GPU Source code for the translator will be released shortly at lava.cs.virginia.edu 24

Thank you 25