Experiences with Sweep3D Implementations in Co-array Fortran

Slides:



Advertisements
Similar presentations
MPI Message Passing Interface
Advertisements

Unified Parallel C at LBNL/UCB Implementing a Global Address Space Language on the Cray X1 Christian Bell and Wei Chen.
Piccolo: Building fast distributed programs with partitioned tables Russell Power Jinyang Li New York University.
Thoughts on Shared Caches Jeff Odom University of Maryland.
1 An Evaluation of Global Address Space Languages: Co-Array Fortran and Unified Parallel C Cristian Coarfa, Yuri Dotsenko, John Mellor-Crummey Rice University.
Reference: Message Passing Fundamentals.
Co-array Fortran Performance and Potential: an NPB Experimental Study Cristian Coarfa Yuri Dotsenko Jason Lee EckhardtJohn Mellor-Crummey Department of.
DISTRIBUTED AND HIGH-PERFORMANCE COMPUTING CHAPTER 7: SHARED MEMORY PARALLEL PROGRAMMING.
ECE669 L5: Grid Computations February 12, 2004 ECE 669 Parallel Computer Architecture Lecture 5 Grid Computations.
Performance Implications of Communication Mechanisms in All-Software Global Address Space Systems Chi-Chao Chang Dept. of Computer Science Cornell University.
Unified Parallel C at LBNL/UCB Empirical (so far) Understanding of Communication Optimizations for GAS Languages Costin Iancu LBNL.
Lecture 29 Fall 2006 Lecture 29: Parallel Programming Overview.
Lecture 4: Parallel Programming Models. Parallel Programming Models Parallel Programming Models: Data parallelism / Task parallelism Explicit parallelism.
ICOM 5995: Performance Instrumentation and Visualization for High Performance Computer Systems Lecture 7 October 16, 2002 Nayda G. Santiago.
Parallel Programming Models Jihad El-Sana These slides are based on the book: Introduction to Parallel Computing, Blaise Barney, Lawrence Livermore National.
1 A Multi-platform Co-Array Fortran Compiler Yuri Dotsenko Cristian Coarfa John Mellor-Crummey Department of Computer Science Rice University Houston,
Architectural Support for Fine-Grained Parallelism on Multi-core Architectures Sanjeev Kumar, Corporate Technology Group, Intel Corporation Christopher.
1 John Mellor-Crummey Cristian Coarfa, Yuri Dotsenko Department of Computer Science Rice University Experiences Building a Multi-platform Compiler for.
Compilation Technology SCINET compiler workshop | February 17-18, 2009 © 2009 IBM Corporation Software Group Coarray: a parallel extension to Fortran Jim.
AN EXTENDED OPENMP TARGETING ON THE HYBRID ARCHITECTURE OF SMP-CLUSTER Author : Y. Zhao 、 C. Hu 、 S. Wang 、 S. Zhang Source : Proceedings of the 2nd IASTED.
UPC Applications Parry Husbands. Roadmap Benchmark small applications and kernels —SPMV (for iterative linear/eigen solvers) —Multigrid Develop sense.
Co-Array Fortran Open-source compilers and tools for scalable global address space computing John Mellor-Crummey Rice University.
Spring 2003CSE P5481 Issues in Multiprocessors Which programming model for interprocessor communication shared memory regular loads & stores message passing.
MPI (continue) An example for designing explicit message passing programs Advanced MPI concepts.
Investigating Adaptive Compilation using the MIPSpro Compiler Keith D. Cooper Todd Waterman Department of Computer Science Rice University Houston, TX.
A Multi-platform Co-array Fortran Compiler for High-Performance Computing John Mellor-Crummey, Yuri Dotsenko, Cristian Coarfa {johnmc, dotsenko,
1 Qualifying ExamWei Chen Unified Parallel C (UPC) and the Berkeley UPC Compiler Wei Chen the Berkeley UPC Group 3/11/07.
Introduction to OpenMP Eric Aubanel Advanced Computational Research Laboratory Faculty of Computer Science, UNB Fredericton, New Brunswick.
Message-Passing Computing Chapter 2. Programming Multicomputer Design special parallel programming language –Occam Extend existing language to handle.
Programmability Hiroshi Nakashima Thomas Sterling.
Experiences with Co-array Fortran on Hardware Shared Memory Platforms Yuri DotsenkoCristian Coarfa John Mellor-CrummeyDaniel Chavarria-Miranda Rice University,
A Multi-platform Co-Array Fortran Compiler for High-Performance Computing Cristian Coarfa, Yuri Dotsenko, John Mellor-Crummey {dotsenko, ccristi,
An Evaluation of Data-Parallel Compiler Support for Line-Sweep Applications Daniel Chavarría-Miranda John Mellor-Crummey Dept. of Computer Science Rice.
3/12/2013Computer Engg, IIT(BHU)1 PARALLEL COMPUTERS- 2.
A Pattern Language for Parallel Programming Beverly Sanders University of Florida.
3/12/2013Computer Engg, IIT(BHU)1 OpenMP-1. OpenMP is a portable, multiprocessing API for shared memory computers OpenMP is not a “language” Instead,
Benchmarking and Applications. Purpose of Our Benchmarking Effort Reveal compiler (and run-time systems) weak points and lack of adequate automatic optimizations.
Co-array Fortran: Compilation, Performance, Languages Issues Cristian Coarfa Yuri Dotsenko John Mellor-Crummey Department of Computer Science Rice University.
Group Members Hamza Zahid (131391) Fahad Nadeem khan Abdual Hannan AIR UNIVERSITY MULTAN CAMPUS.
A Parallel Communication Infrastructure for STAPL
Chapter 4: Threads Modified by Dr. Neerja Mhaskar for CS 3SH3.
Chapter 4: Threads.
Processes and threads.
An Emerging, Portable Co-Array Fortran Compiler for High-Performance Computing Daniel Chavarría-Miranda, Cristian Coarfa, Yuri.
Distributed Shared Memory
CS5102 High Performance Computer Systems Thread-Level Parallelism
Parallel Programming By J. H. Wang May 2, 2017.
MPI Point to Point Communication
Computer Engg, IIT(BHU)
Parallel Algorithm Design
Parallel Programming with MPI and OpenMP
CMSC 611: Advanced Computer Architecture
The Future of Fortran is Bright …
Chapter 4: Threads.
Department of Computer Science University of California, Santa Barbara
More on MPI Nonblocking point-to-point routines Deadlock
MPI-Message Passing Interface
Shared Memory Programming
Parallelization of An Example Program
CSCE569 Parallel Computing
More on MPI Nonblocking point-to-point routines Deadlock
MPJ: A Java-based Parallel Computing System
Approximating the Buffer Allocation Problem Using Epochs
Chapter 4: Threads & Concurrency
HPC User Forum: Back-End Compiler Technology Panel
Department of Computer Science University of California, Santa Barbara
An Orchestration Language for Parallel Objects
Support for Adaptivity in ARMCI Using Migratable Objects
Introduction to Optimization
Programming Parallel Computers
Presentation transcript:

Experiences with Sweep3D Implementations in Co-array Fortran Cristian Coarfa Yuri Dotsenko John Mellor-Crummey Department of Computer Science Rice University Houston, TX USA Good afternoon everyone. My name is Cristian Coarfa and today I’m going to talk about our experiences with Sweep3D implementations in Co-array Fortran. This is joint work with Yuri Dotsenko and John Mellor-Crummey

Parallel Programming Models Motivation Parallel Programming Models MPI: de facto standard difficult to program OpenMP: inefficient to map on distributed memory platforms lack of locality control HPF: hard to obtain high-performance heroic compilers needed! An appealing middle ground: global address space languages: CAF, Titanium, UPC The increasing size of the current parallel systems requires programming models that enhance the developers’ productiviy without compromising the performance; we would like such models to work well with a broad range of applications and systems High-level programming models already exist, but they have several drawbacks With HPF it’s not easy to obtain high-performance, requiring heroic compiler effort OpenMP doesn’t lend itself to efficient implementations on cluster platforms. This led to MPI becoming the de facto standard for parallel programming; MPI offers portability, but the developer is solely responsible for choreographing the computation and communication to achieve high performance; MPI is also difficult to program, and not amenable to compiler-based optimizations An appealing middle ground is represented by the family of global address space languages such as Co-Array Fortran, Unified Paralle C and Titanium. In this talk we will focus on Co-Array Fortran Evaluate CAF for an application with sophisticated parallelization: Sweep3D

Co-Array Fortran Global address space programming model one-sided communication (GET/PUT) Programmer has control over performance-critical factors data distribution computation partitioning communication placement Data movement and synchronization as language primitives amenable to compiler-based communication optimization Co-array Fortran, abbreviated CAF, is a global address space programming model, and uses one-sided communication through PUT and GET operations. Developper controls performance-critical factors such as data distribution, computation partitioning and comm. Placement Data movement and synchronization are expressed at language level as primitives, making CAF amenable to compiler optimization of communication

CAF Programming Model Features SPMD process images fixed number of images during execution images operate asynchronously Both private and shared data real x(20, 20) a private 20x20 array in each image real y(20, 20)[*] a shared 20x20 array in each image Simple one-sided shared-memory communication x(:,j:j+2) = y(:,p:p+2)[r] copy columns from image r into local columns Synchronization intrinsic functions sync_all – a barrier and a memory fence sync_mem – a memory fence sync_team([team members to notify], [team members to wait for]) Pointers and (perhaps asymmetric) dynamic allocation CAF is a SPMD programming model Enables programmer to specify both private and shared data; one uses the bracket notation to specify a co-aray To access remote data, the bracket notation is used again to specify a remote image number CAF offers language-level synchronziation such as barrier, memory fence and group communication Next I will give you a visual presentation of CAF communication

One-sided Communication with Co-Arrays integer a(10,20)[*] if (this_image() > 1) a(1:10,1:2) = a(1:10,19:20)[this_image()-1] a(10,20) a(10,20) a(10,20) image 1 image 2 image N Copy from left neighbor h I would like to present a visual representation of co-arrays and co-array communication The bracket at the end of the declaration of a means that a is a co-array; each image has a shared 10x20 array; the collection of all these shared array represents the co-array a The communication model is one sided. Both the source and the destination are explicit. In this example every image except the first copy data from the left neighbour; this_image is a CAF primitive that returns the number of the current process image image 1 image 2 image N

Outline CAF programming model cafc Sweep3D implementations in CAF Experimental evaluation Conclusions Next I’ll talk about the Co-array Fortran Compiler developed at Rice University.

Rice Co-Array Fortran Compiler (cafc) First CAF multi-platform compiler previous compiler only for Cray shared memory systems Implements core of the language currently lacks support for derived type and dynamic co-arrays Core sufficient for non-trivial codes Performance comparable to that of hand-tuned MPI codes Open source CAF was previously implemented only on CRAY shared memory systems. For a parallel programming model to be attractive, it needs to have a portable implementation. At Rice University we developed the first multi-platform CAF compiler, called cafc. Cafc implements the core of CAF, enough to support non-trivial codes CAF codes compiled with cafc have achieved a performance similar to hand-tuned MPI codes Our compiler is open-source and available for download on the web

cafc Implementation Strategy Goals portability high-performance on a wide range of platforms Source-to-source compilation of CAF codes uses Open64/SL Fortran 90 infrastructure CAF ® Fortran 90 + communication operations Communication ARMCI library for one-sided communication on clusters (PNNL) load/store communication on shared-memory platforms For portability, cafc performs a source-to-source compilation of CAF codes. We use the Open64/SL Fortran90 infrastructure; CAF codes are translated into F90 codes + communication operations. For communication we are using the portable communication library ARMCI, developed at PNNL And load/store comm on shared memory platforms

Synchronization Original CAF specification: team synchronization only sync_all, sync_team Limits performance on loosely-coupled architectures Point-to-point extensions sync_notify(q) sync_wait(p) Point to point synchronization semantics Delivery of a notify to q from p  all communication from p to q issued before the notify has been delivered to q The original CAF synchronization model contains only team synchronization primitives, such as sync_all and sync_team These might limit the performance of CAF codes on loosely coupled architectures We proposed to extend the CAF comm model w/ point-to-point primitives, namely sync_notify And sync_wait The semantics of the point to point synchronization primitives is that on the delivery a notify to q from p, all communication from p to q issued before the notify has been delivered to q

CAF Compiler Targets (Oct 2004) Processors Pentium, Alpha, Itanium2, MIPS Interconnects Quadrics, Myrinet, Gigabit Ethernet, shared memory Operating systems Linux, Tru64, IRIX At the moment, cafc runs on and generates for code for a wide range of architectures

Outline CAF programming model cafc Sweep3D implementations Original MPI implementation CAF versions Communication microbenchmark Experimental evaluation Conclusions Next I will present several CAF implementations of Sweep3D

Sweep3D Core of an ASCI application Solves a one-group time-independent discrete ordinates (Sn) 3D Cartesian (XYZ) geometry neutron transport problem Deterministic particle transport accounts for 50-80% execution time of many realistic DOE simulations Sweep3D represents the core of an ASCI application It solves a one-group, time-independent, discrete ordinates, 3D Cartesian geometry neutron transport problem It is important to implement Sweep3D efficiently because deterministic particle transport accounts for 50 to 80 percent of the time for many realistic DOE simulations. Using the original MPI version of Sweep3D

Sweep3D Parallelization The parallel version of sweep3d uses a 2d spatial domain decomposition onto a 2D processor array And it employs wavefront parallelism. I will show a visual representation of the computation/communication pattern of sweep3d. 2D spatial domain decomposition onto a 2D processor array

Sweep3D Parallelization The first processor computes on its data, Wavefront parallelism

Sweep3D Parallelization Then sends information to its neighbors at west and south Wavefront parallelism

Sweep3D Parallelization After which the first processor goes to the second iteration of its computation, while its west and south neighbors perform their first iteration. Wavefront parallelism

Sweep3D Parallelization In the next step all three processors send data to their west and south neighbors, Wavefront parallelism

Sweep3D Parallelization The wavefront advances, all processors on or before the wavefront are active Wavefront parallelism

Sweep3D Parallelization Again everybody communicates, Wavefront parallelism

Sweep3D Parallelization Let’s assume that the processors only perform three iterations, so the top left processor is done Wavefront parallelism

Sweep3D Parallelization The process continues until all the processors have finished the computation Wavefront parallelism

Sweep3D Parallelization Wavefront parallelism

Sweep3D Parallelization Wavefront parallelism

Sweep3D Parallelization Wavefront parallelism

Sweep3D Kernel Pseudocode do iq=1,8 do mo = 1, mmo do kk = 1, kb recv e/w into Phiib recv n/s into Phijb ... ! heavy computation with use/update ! of Phiib and Phijb send e/w Phiib send n/s Phijb enddo Next I will show the pseudocode of the sweep3d kernel. The first loops control the angle and granularity of the pipeline ?

Sweep3D Kernel Pseudocode do iq=1,8 do mo = 1, mmo do kk = 1, kb recv e/w into Phiib recv n/s into Phijb ... ! heavy computation with use/update ! of Phiib and Phijb send e/w Phiib send n/s Phijb enddo Processor receive data from their e/w and n/s neighbors, if necessary

Sweep3D Kernel Pseudocode do iq=1,8 do mo = 1, mmo do kk = 1, kb recv e/w into Phiib recv n/s into Phijb ... ! heavy computation with use/update ! of Phiib and Phijb send e/w Phiib send n/s Phijb enddo They perform the necessary computation

Sweep3D Kernel Pseudocode do iq=1,8 do mo = 1, mmo do kk = 1, kb recv e/w into Phiib recv n/s into Phijb ... ! heavy computation with use/update ! of Phiib and Phijb send e/w Phiib send n/s Phijb enddo And then send data to their successors, if necessary

Initial Sweep3D CAF Implementation Based on the MPI implementation Maintain original computation Convert communication buffers into co-arrays Fundamental issue: converting from two-sided communication into one-sided communication Next I’m going to talk about our first CAF implementation. We derived it from the MPI implementation available on the web. A crucial issue was converting from 1-sided to 2-sided communication.

2-sided vs 1-sided Communication Let’s examine in more detail the 2- and 1-sided comm details The thread on the left is the sender and the thread on the right is the receiver 2-sided comm

2-sided vs 1-sided Communication MPI_Send MPI_Recv In the MPI version, the sender calls mpi_send, and the receiver performs a call to mpi_receive 2-sided comm

2-sided vs 1-sided Communication MPI_Send MPI_Recv There are two important points to note about the mpi comm: there is an implicit synchronization between sender and receiver: The mpi library manages communication buffers automatically 2-sided comm

2-sided vs 1-sided Communication MPI_Send MPI_Recv In CAF, the comm. Buffer management and synchronization are exposed at the language level The thread on the left is the source and the thread on the right is the destination 2-sided comm 1-sided comm

2-sided vs 1-sided Communication sync_notify sync_wait MPI_Send MPI_Recv In the general case, the source needs to receive a notification that the data on the destination can be written into, to avoid data races 2-sided comm 1-sided comm

2-sided vs 1-sided Communication sync_notify sync_wait PUT MPI_Send MPI_Recv Then it performs a put 2-sided comm 1-sided comm

2-sided vs 1-sided Communication sync_notify sync_wait PUT MPI_Send MPI_Recv sync_notify Followed by a notify. The sync_wait 2-sided comm 1-sided comm

2-sided vs 1-sided Communication sync_notify sync_wait PUT MPI_Send MPI_Recv sync_notify The destination consumes the notify with a sync_wait call, at which point it knows that the communication event completed, and both source and destination can advance their computation sync_wait 2-sided comm 1-sided comm

CAF Implementation Issues Synchronization necessary to avoid data races might lead to inefficiency Using multiple communication buffers enables overlap of synchronization with computation

One- vs. Two-buffer Communication One-buffer communication pipeline bubbles virtually no bubbles ! pipeline bubbles source dest d Two-buffers communication the notify arrives before the source calls sync_wait source dest

Asynchrony-tolerant CAF Implementation of Sweep3D Multiple-versioned communication buffers Benefits Overlap PUT with computation on destination Overlap of synchronization with computation on source There are several ways to make CAF comm more efficient: eg less synchronization (using either transitive properties of point-to-point synchronization), using more buffer space (one buffer per communication event) We designed an asynchrony-tolerant implementation of Sweep3D using multiple versioned comm. Buffers. Which enable the overlap of a PUT by the source process image with computation on the destination process image. In the case of a cluster with support for non-blocking communication and synchronization, one could use three buffers and in the stationary state one buffer version is written into asynchronously by the predecessor, one version is used for computation by the current process image, and the last buffer version is used for a non-blocking put to successor. In practice, we have implemented the multi-versioned buffers as an array of buffers Next I will show the benefits of having multi-version buffers Stationary state for a three-versioned communication buffer: one buffer version written into asynchronously by predecessor one buffer version computed on by current process image one buffer version used for a non-blocking put to successor

Three-buffer Communication From predecessor To successor Compute From predecessor To successor Compute From predecessor To successor Compute

Communication Throughput Microbenchmark MPI implementation: blocking send and receive CAF one-version buffer CAF multi-versioned buffers ARMCI implementation: one buffer

Outline CAF programming model cafc Sweep3D implementations Experimental evaluation Conclusions

Experimental Evaluation Platforms Itanium2+Quadrics QSNet II (Elan4) SGI Altix 3000 Itanium2+Myrinet 2000 Alpha+Quadrics QSNet (Elan3) Problem sizes 50x50x50 150x150x150 300x300x300

Itanium2 + Quadrics, Size 50x50x50

Itanium2 + Quadrics, Size 150x150x150

Itanium2 + Quadrics, Size 300x300x300 multi-version buffers improve performance of CAF codes by 15% imperative to use non-blocking notifies

Itanium2+Quadrics, Communication Throughput Microbenchmark explain axis & curves multi-version buffers improve throughput by 30 for messages up to 8KB by 10% for messages larger than 8KB overhead of the CAF translation is acceptable

SGI Altix 3000, Size 50x50x50

SGI Altix 3000, Size 150x150x150 multi-version buffers are effective for asynchrony-tolerance

SGI Altix 3000, Size 300x300x300 both CAF implementations outperforms MPI

SGI Altix 3000, Communication Throughput Microbenchmark Warm cache ARMCI library exploits effectively the hardware support for efficient data movement MPI performs extra data copies Calling system-tuned memcpy MPI uses extra memory copy

Summary of results MPI buffering for small messages helps latency & asynchrony tolerance CAF multi-version buffers improve performance of one-sided communication for wavefront computations enables PUT and receiver’s computation to overlap asynchrony tolerance between sender and receiver Non-blocking notifies are important for performance enables synchronization to overlap with computation Platform results CAF outperforms MPI for large problem sizes by ~10% on Itanium2+{Quadrics,Myrinet,Altix} CAF ~16%slower on Alpha+Quadrics(Elan3) ARMCI lacks non-blocking notifies on Elan3

Enhancing CAF Usability CAF vs MPI usability easier to use than MPI for simple parallel programs as difficult for carefully-tuned parallel codes Improving CAF ease of use compiler support for managing multi-version communication buffers vectorizing fine-grain communication to best support X1 and cluster platforms with same code http://www.hipersoft.rice.edu/caf

Implementing Communication x(1:n) = a(1:n)[p] + … Use a temporary buffer to hold off processor data allocate buffer perform GET to fill buffer perform computation: x(1:n) = buffer(1:n) + … deallocate buffer Optimizations no temporary storage for co-array to co-array copies load/store communication on shared-memory systems

Detailed Results Itanium2+Quadrics(Elan4) Alpha+Quadrics(Elan3) similar for 503, 9% better for 1503 and 3003 Alpha+Quadrics(Elan3) 8% better for 503, 16% lower for 1503 and similar for 3003 ARMCI lacks non-blocking notifies on Elan3 SGI Altix 3000 comparable for 503 and 1503, 10% better for 3003 Itanium2+Myrinet similar for 503, 12% better for 1503 and 9% better for 3003

SGI Altix 3000, communication throughput microbenchmark Warm cache Cold cache

One- vs. Two-buffer Communication One-buffer communication delays delays source dest d Two-buffers communication Ideally, we want the delay on the source image to be zero: the notify arrives before the source calls sync_wait source dest smaller delays !

Asynchrony-tolerant CAF Implementation sync_notify sync_wait PUT one comm. buffer sync_notify sync_notify

Asynchrony-tolerant CAF Implementation sync_notify sync_wait PUT one comm. buffer sync_notify sync_wait PUT two comm. buffers sync_notify sync_notify

Asynchrony-tolerant CAF Implementation sync_notify sync_wait PUT one comm. buffer sync_notify sync_wait PUT two comm. buffers sync_notify sync_notify

Asynchrony-tolerant CAF Implementation sync_notify sync_wait PUT one comm. buffer sync_notify sync_wait PUT two comm. buffers sync_notify sync_notify