Lawrence Livermore National Laboratory Manycore Optimizations: A Compiler and Language Independent ManyCore Runtime System ROSE Team Center for Applied.

Slides:

Advertisements

Similar presentations

Lawrence Livermore National Laboratory ROSE Compiler Project Computational Exascale Workshop December 2010 Dan Quinlan Chunhua Liao, Justin Too, Robb Matzke,

Advertisements

OpenMP Optimization National Supercomputing Service Swiss National Supercomputing Center.

Parallel Processing with OpenMP

Unified Parallel C at LBNL/UCB Implementing a Global Address Space Language on the Cray X1 Christian Bell and Wei Chen.

COMPUTER GRAPHICS CS 482 – FALL 2014 NOVEMBER 10, 2014 GRAPHICS HARDWARE GRAPHICS PROCESSING UNITS PARALLELISM.

Programming Languages and Paradigms

Lecture 6: Multicore Systems

Introductions to Parallel Programming Using OpenMP

Ensuring Operating System Kernel Integrity with OSck By Owen S. Hofmann Alan M. Dunn Sangman Kim Indrajit Roy Emmett Witchel Kent State University College.

LLNL-PRES This work was performed under the auspices of the U.S. Department of Energy by Lawrence Livermore National Laboratory under contract DE-AC52-07NA27344.

1 Lawrence Livermore National Laboratory By Chunhua (Leo) Liao, Stephen Guzik, Dan Quinlan A node-level programming model framework for exascale computing*

Revisiting a slide from the syllabus: CS 525 will cover Parallel and distributed computing architectures – Shared memory processors – Distributed memory.

1 Synthesis of Distributed ArraysAmir Kamil Synthesis of Distributed Arrays in Titanium Amir Kamil U.C. Berkeley May 9, 2006.

DISTRIBUTED AND HIGH-PERFORMANCE COMPUTING CHAPTER 7: SHARED MEMORY PARALLEL PROGRAMMING.

Computer Architecture II 1 Computer architecture II Programming: POSIX Threads OpenMP.

Java for High Performance Computing Jordi Garcia Almiñana 14 de Octubre de 1998 de la era post-internet.

Hash Tables1 Part E Hash Tables  

CUDA Programming Lei Zhou, Yafeng Yin, Yanzhi Ren, Hong Man, Yingying Chen.

Run time vs. Compile time

Tile Reduction: the first step towards tile aware parallelization in OpenMP Ge Gan Department of Electrical and Computer Engineering Univ. of Delaware.

1 Run time vs. Compile time The compiler must generate code to handle issues that arise at run time Representation of various data types Procedure linkage.

1 Dan Quinlan, Markus Schordan, Qing Yi Center for Applied Scientific Computing Lawrence Livermore National Laboratory Semantic-Driven Parallelization.

OpenMP in a Heterogeneous World Ayodunni Aribuki Advisor: Dr. Barbara Chapman HPCTools Group University of Houston.

Status of ROSE Project Work Dan Quinlan Chunhua Liao, Peter Pirkelbauer Combustion Exascale CoDesign Center All Hands March 1, 2012.

CuMAPz: A Tool to Analyze Memory Access Patterns in CUDA

GPU Programming David Monismith Based on notes taken from the Udacity Parallel Programming Course.

ICOM 5995: Performance Instrumentation and Visualization for High Performance Computer Systems Lecture 7 October 16, 2002 Nayda G. Santiago.

Parallel Programming Models Jihad El-Sana These slides are based on the book: Introduction to Parallel Computing, Blaise Barney, Lawrence Livermore National.

Lecture 8 – Stencil Pattern Stencil Pattern Parallel Computing CIS 410/510 Department of Computer and Information Science.

CS 403: Programming Languages Lecture 2 Fall 2003 Department of Computer Science University of Alabama Joel Jones.

Introduction to CUDA (1 of 2) Patrick Cozzi University of Pennsylvania CIS Spring 2012.

Introduction to CUDA 1 of 2 Patrick Cozzi University of Pennsylvania CIS Fall 2012.

Cg Programming Mapping Computational Concepts to GPUs.

8.4 paging Paging is a memory-management scheme that permits the physical address space of a process to be non-contiguous. The basic method for implementation.

Automatic translation from CUDA to C++ Luca Atzori, Vincenzo Innocente, Felice Pantaleo, Danilo Piparo 31 August, 2015.

GPU Architecture and Programming

CS 193G Lecture 7: Parallel Patterns II. Overview Segmented Scan Sort Mapreduce Kernel Fusion.

VIRTUAL MEMORY By Thi Nguyen. Motivation  In early time, the main memory was not large enough to store and execute complex program as higher level languages.

HPC User Forum Back End Compiler Panel SiCortex Perspective Kevin Harris Compiler Manager April 2009.

1 Parallel Programming Aaron Bloomfield CS 415 Fall 2005.

CS 326 Programming Languages, Concepts and Implementation Instructor: Mircea Nicolescu Lecture 9.

Spring 2003CSE P5481 Issues in Multiprocessors Which programming model for interprocessor communication shared memory regular loads & stores message passing.

10/02/2012CS4230 CS4230 Parallel Programming Lecture 11: Breaking Dependences and Task Parallel Algorithms Mary Hall October 2,

U NIVERSITY OF M ASSACHUSETTS, A MHERST Department of Computer Science Fortress John Burgess and Richard Chang CS691W University of Massachusetts Amherst.

Run-Time Storage Organization Compiler Design Lecture (03/23/98) Computer Science Rensselaer Polytechnic.

Training Program on GPU Programming with CUDA 31 st July, 7 th Aug, 14 th Aug 2011 CUDA Teaching UoM.

Compiler and Runtime Support for Enabling Generalized Reduction Computations on Heterogeneous Parallel Configurations Vignesh Ravi, Wenjing Ma, David Chiu.

Introduction to CUDA (1 of n*) Patrick Cozzi University of Pennsylvania CIS Spring 2011 * Where n is 2 or 3.

CS/EE 217 GPU Architecture and Parallel Programming Midterm Review

Porting processes to threads with MPC instead of forking Some slides from Marc Tchiboukdjian (IPDPS’12) : Hierarchical Local Storage Exploiting Flexible.

Programming an SMP Desktop using Charm++ Laxmikant (Sanjay) Kale Parallel Programming Laboratory Department of Computer Science.

High-level Interfaces for Scalable Data Mining Ruoming Jin Gagan Agrawal Department of Computer and Information Sciences Ohio State University.

Introduction to CUDA 1 of 2 Patrick Cozzi University of Pennsylvania CIS Fall 2014.

AUTO-GC: Automatic Translation of Data Mining Applications to GPU Clusters Wenjing Ma Gagan Agrawal The Ohio State University.

3/12/2013Computer Engg, IIT(BHU)1 OpenMP-1. OpenMP is a portable, multiprocessing API for shared memory computers OpenMP is not a “language” Instead,

C OMPUTATIONAL R ESEARCH D IVISION 1 Defining Software Requirements for Scientific Computing Phillip Colella Applied Numerical Algorithms Group Lawrence.

Parallel Computing Presented by Justin Reschke

CPE779: Shared Memory and OpenMP Based on slides by Laxmikant V. Kale and David Padua of the University of Illinois.

Department of Computer Science, Johns Hopkins University Lecture 7 Finding Concurrency EN /420 Instructor: Randal Burns 26 February 2014.

Martin Kruliš by Martin Kruliš (v1.1)1.

Application of Design Patterns to Geometric Decompositions V. Balaji, Thomas L. Clune, Robert W. Numrich and Brice T. Womack.

1 ”MCUDA: An efficient implementation of CUDA kernels for multi-core CPUs” John A. Stratton, Sam S. Stone and Wen-mei W. Hwu Presentation for class TDT24,

A Parallel Communication Infrastructure for STAPL

COMPUTER GRAPHICS CHAPTER 38 CS 482 – Fall 2017 GRAPHICS HARDWARE

Support for Program Analysis as a First-Class Design Constraint in Legion Michael Bauer 02/22/17.

John Levesque Director Cray Supercomputing Center of Excellence

SHARED MEMORY PROGRAMMING WITH OpenMP

Dycore Rewrite Tobias Gysi.

Allen D. Malony Computer & Information Science Department

Presentation transcript:

Lawrence Livermore National Laboratory Manycore Optimizations: A Compiler and Language Independent ManyCore Runtime System ROSE Team Center for Applied Scientific Computing Lawrence Livermore National Laboratory Lawrence Livermore National Laboratory, P. O. Box 808, Livermore, CA Operated by Lawrence Livermore National Security, LLC, or the U.S. Department of Energy, National Nuclear Security Administration under Contract DE-AC52-07NA27344

2 Single core data layout will be crucial to memory performance  Independent of distributed memory data partitioning  Beyond scope of Control Parallelism (OpenMP, Pthreads, etc.)  How we layout data effects performance of how it is used  New Languages and Programming Models have the opportunity to encapsulate the data layout; but data layout can be addressed directly  General purpose languages provide the mechanisms to tightly bind the the implementation to the data layout (providing low level control over issues required to get good performance)  Applications are commonly expressed at a low level which binds the implementation and the data layout (and are encouraged to do so to get good performance)  Compilers can’t unravel code enough to make the automated global optimizations to data layout that are required Science & Technology: Computation Directorate

3 Runtime systems can assist data layout optimizations  Assume user will permit use of array abstraction 40 years of history in array languages currently used in F90 target for many-core BoxLib FAB abstraction  Motivating goal is to support exascale architectures Science & Technology: Computation Directorate

4 Exascale architectures will include intensive memory usage and less memory coordination  A million processors (not relevant for this many-core runtime system)  A thousand cores per processor 1 Tera-FLOP per processor 0.1 bytes per FLOP Memory bandwidth 4TB/sec to 1TB/sec We assume NUMA Assume no cross-chip cache coherency  Or it will be expensive (performance and power)  So assume we don’t want to use it…  Can DOE applications operate with these constraints? Science & Technology: Computation Directorate

5 We distribution each array into many pieces for many cores…  Assume a 1-to-1 mapping of pieces of the array to cores  Could be many to one to support latency hiding…  Zero false sharing  no cache coherency requirements Science & Technology: Computation Directorate Single Array Abstraction Core 0 array section Core 1 array section Core 2 array section Core 3 array section Mapping of logical array positions to physical array positions distributed over cores

6 There are important constraints, just to make this more clear…  Only handle stencil operations  No reductions…  No indirect addressing…  Assume machine has low level support for synchronization  Regular structure grid operations…  Support for irregular computation would be handled via either Pat’s Lizt (Stanford) abstraction or Keshav’s Galois runtime system (University of Texas) Science & Technology: Computation Directorate

7 Many scientific data operations are applied to block-structured geometries  Supports Multi-dimensional array data  Cores can be configured into logical hypercube topologies Currently multi-dimensional periodic arrays of cores (core arrays) Operations on data on cores can be tiled for better cache performance  Constructor takes multidimensional array size and target multi-dimensional core array size  Supports table based and algorithm based distributions Science & Technology: Computation Directorate Multi-dimensional Data Simple 3D Core Array (core arrays on 1K cores could be 10^3)

8 A high level interface for block-structured operations enhances performance and debugging across cores  This is a high level interface that permits debugging  Indexing provides abstraction for the complexity of data that is distributed over many cores template void relax2D_highlevel( MulticoreArray & array, MulticoreArray & old_array ) { // This is a working example of a 3D stencil demonstrating a high level interface // suitable only as debugging support. #pragma omp parallel for for (int k = 1; k < array.get_arraySize(2)-1; k++) { #pragma omp for for (int j = 1; j < array.get_arraySize(1)-1; j++) { for (int i = 1; i < array.get_arraySize(0)-1; i++) { array(i,j,k) = ( old_array(i-1,j,k) + old_array(i+1,j,k) + old_array(i,j-1,k) + old_array(i,j+1,k) + old_array(i,j,k+1) + old_array(i,j,k-1) ) / 6.0; } Science & Technology: Computation Directorate Indexing hides distribution of data over many cores

9 Mid level interface as target for compiler generated or maybe also user code (unclear if this is a good user target)  Midlevel interface…simple… but not as high performance as the low level interface (next slide)… template void relax2D_highlevel( MulticoreArray & array, MulticoreArray & old_array ) { // This is a working example of the relaxation associated with the a stencil on the array abstraction // mapped to the separate multi-dimensional memories allocated per core and onto a multi-dimensional // array of cores (core array). int numberOfCores_X = array.get_coreArraySize(0); int numberOfCores_Y = array.get_coreArraySize(1); // Use OpenMP to support the threading... #pragma omp parallel for for (int core_X = 0; core_X < numberOfCores_X; core_X++) { #pragma omp for for (int core_Y = 0; core_Y < numberOfCores_Y; core_Y++) { // This lifts out loop invariant portions of the code. Core & coreMemory = array.getCore(core_X,core_Y,0); // Lift out loop invariant local array size values. int sizeX = coreMemory.coreArrayNeighborhoodSizes_2D[1][1][0]; int sizeY = coreMemory.coreArrayNeighborhoodSizes_2D[1][1][1]; int base_X = (coreMemory.bounaryCore_2D[0][0] == true) ? 1 : 0; int bound_X = (coreMemory.bounaryCore_2D[0][1] == true) ? sizeX - 2: sizeX - 1; int base_Y = (coreMemory.bounaryCore_2D[1][0] == true) ? 1 : 0; int bound_Y = (coreMemory.bounaryCore_2D[1][1] == true) ? sizeY - 2: sizeY - 1; for (int j = base_Y; j <= bound_Y; j++) { for (int i = base_X; i <= bound_X; i++) { // Compiler generated code based on user application array.getCore(core_X,core_Y,0)(i,j,0) = ( old_array.getCore(core_X,core_Y,0)(i-1,j,0) + old_array.getCore(core_X,core_Y,0)(i+1,j,0) + old_array.getCore(core_X,core_Y,0)(i,j-1,0) + old_array.getCore(core_X,core_Y,0)(i,j+1,0) ) / 4.0; } Science & Technology: Computation Directorate Indexing could alternatively use loop invariant references (shown not using such references to demonstrate explicit core indexing) Accesses to core indexing data shown using core data structure reference Construct core data structure reference Use OpenMP for control parallelism Note: array element index references outside of current indexed core generate array references to adjacent core array element index reference on referenced core core index reference

10 Low level code for stencil on data distributed over many cores (to be compiler generated high performance code) template void relax2D( MulticoreArray & array, MulticoreArray & old_array ) { // This is a working example of the relaxation associated with the a stencil on the array abstraction // mapped to the separate multi-dimensional memorys allocated per core and onto a multi-dimenional // array of cores (core array). int numberOfCores = array.get_numberOfCores(); // Macro to support linearization of multi-dimensional 2D array index computation #define local_index2D(i,j) (((j)*sizeX)+(i)) // Use OpenMP to support the threading... #pragma omp parallel for for (int core = 0; core < numberOfCores; core++) { // This lifts out loop invariant portions of the code. T* arraySection = array.get_arraySectionPointers()[core]; T* old_arraySection = old_array.get_arraySectionPointers()[core]; // Lift out loop invariant local array size values. int sizeX = array.get_coreArray()[core]->coreArrayNeighborhoodSizes_2D[1][1][0]; int sizeY = array.get_coreArray()[core]->coreArrayNeighborhoodSizes_2D[1][1][1]; for (int j = 1; j < sizeY-1; j++) { for (int i = 1; i < sizeX-1; i++) { // This is the dominant computation for each array section per core. The compiler will use the // user's code to derive the code that will be put here. arraySection[local_index2D(i,j)] = (old_arraySection[local_index2D(i-1,j)] + old_arraySection[local_index2D(i+1,j)] + old_arraySection[local_index2D(i,j-1)] + old_arraySection[local_index2D(i,j+1)]) / 4.0; } // We could alternatively generate the call for relaxation for the internal boundaries in the same loop. array.get_coreArray()[core]->relax_on_boundary(core,array,old_array); } // undefine the local 2D index support macro #undef local_index2D } Science & Technology: Computation Directorate Loop over all cores (linearized array) Stencil (or any other local code) generated from user applications OpenMP used to provide control parallelism

11 Call to low level compiler generated code to support internal boundary relaxation on the edges of each core  Relaxation (stencil) operator is applied on the boundary of each memory allocated to each core  Relies on share memory support on processor  Relaxation code for internal core boundaries is complex  Lots of cases for faces, edges, and corners  More complex for higher dimensional data  Current work supports 1D and 2D relaxation on internal core boundaries template void relax2D_on_boundary( MulticoreArray & array, MulticoreArray & old_array ) { // This function supports the relaxation operator on the internal boundaries // of the different arrays allocated on a per core basis. We take advantage // of shared memory to support the stencil operations. int numberOfCores = array.get_numberOfCores(); #pragma omp parallel for for (int core = 0; core < numberOfCores; core++) { // Relaxation on edges of specific core (too large to show on slide)… array.get_coreArray()[core]->relax_on_boundary(core,array,old_array); } Science & Technology: Computation Directorate

12 Indexing for boundaries of core (stencil on core edges)  Example shows generated code for stencil on core edges  No ghost boundaries are required…but could be used (not implemented yet)  Array element “[Y-1][X]” is a reference to an element on a different cores memory  The use of this approach avoids ghost boundaries  But there are a lot of cases for each side of a multidimensional array  1D: 2 vertices  2D: 4 edges and 4 vertices  3D: 6 faces, 12 edges, and 8 vertices  4D: more of each…  2D example code fragment of upper edge relaxation on specific core // Upper edge // ***** | ****** | ***** // // ***** | *XXXX* | ***** // ***** | ****** | ***** // // ***** | ****** | ***** for (int i = 1; i < coreArrayNeighborhoodSizes_2D[1][1][0]-1; i++) { arraySection[index2D(i,0)] = ( /* array[Y-1][X] */ old_arraySectionPointers[coreArrayNeighborhoodLinearized_2D[0][1]][index2D(i,coreArrayNeighborhoodSizes_2D[0][1][1]-1)] + /* array[Y+1][X] */ old_arraySection[index2D(i,1)] + /* array[Y][X-1] */ old_arraySection[index2D(i-1,0)] + /* array[Y][X+1] */ old_arraySection[index2D(i+1,0)]) / 4.0; } Science & Technology: Computation Directorate Array data reference on upper (adjacent) core Array data reference on current core

13 We use libnuma to allocate the separate memory for each core closest to that core for best possible performance  NUMA based allocation of array subsection for each core (using memory closest to each core). template void MulticoreArray ::allocateMemorySectionsPerCore() { // This is the memory allocation support for each core to allocate memory that is as close as possible to it // within the NUMA processor architecture (requires libnuma for best portable allocation of closest memory // to each core). #pragma omp parallel for for (int core = 0; core < numberOfCores; core++) { int size = memorySectionSize(core); #if HAVE_NUMA_H // Allocate memory using libnuma to get local memory for the associated core. arraySectionPointers[core] = (float*) numa_alloc_local((size_t)(size*sizeof(T))); // Interestingly, libnuma will return a NULL pointer if ask to allocate zero bytes // (but we want the semantics to be consistant with C++ allocation). if (size == 0 && arraySectionPointers[core] == NULL) { arraySectionPointers[core] = new float[size]; assert(arraySectionPointers[core] != NULL); } #else arraySectionPointers[core] = new float[size]; #endif assert(arraySectionPointers[core] != NULL); // Initialize the memory section pointer stored in the Core. assert(coreArray[core] != NULL); coreArray[core]->arraySectionPointer = arraySectionPointers[core]; assert(coreArray[core]->arraySectionPointer != NULL); } Science & Technology: Computation Directorate Libnuma specific code Non-Libnuma specific code Update Core in array of cores OpenMP used to provide control parallelism

14 Fortran example for 2D stencil operation using halos  Example shows halo exchange so all halo memory is sync’d and individual cores can begin computation on their tile  Halos required by runtime and the use of halos actually simplifies code for users  Otherwise, Array element “[Y-1][X]” is a reference to an element on a different cores memory  I don’t think is a problem, looks like coarrays, but when is memory transferred? /* synchronize and transfer memory between cores and GPUs */ /* memory for cores and GPU buffers allocated previously */ exchange_halo(Array); /* user code */ /* I’m assuming this is “compiler generated” code */ for (int i = 1; i < coreArrayNeighborhoodSizes_2D[1][1][0]-1; i++) { /* call OpenCL runtime to run kernel on each GPU */ /* GPU memory (and arguments) set up previously by compiler */ clEnqueueNDRangeKernel(…,kernel, 2/*numDims*/, global_work_offset, global_work_size, local_work_size, …); } /* skeleton for GPU kernel */ __kernel relax_2D( __global float * Array, __global float * oldArray, __local float * tile) { /* fill “cache” with oldArray plus halo */ copy_to_local(tile, oldArray); /* array offsets are macros based on tile/local cache size */ Array[CENTER] = (tile[LEFT] + tile[RIGHT] + tile[DOWN] + time[UP]) / 4.0f; } Science & Technology: Computation Directorate