Experiences with Sweep3D Implementations in Co-array Fortran

Experiences with Sweep3D Implementations in Co-array Fortran
Cristian Coarfa Yuri Dotsenko John Mellor-Crummey Department of Computer Science Rice University Houston, TX USA Good afternoon everyone. My name is Cristian Coarfa and today I’m going to talk about our experiences with Sweep3D implementations in Co-array Fortran. This is joint work with Yuri Dotsenko and John Mellor-Crummey

Parallel Programming Models
Motivation Parallel Programming Models MPI: de facto standard difficult to program OpenMP: inefficient to map on distributed memory platforms lack of locality control HPF: hard to obtain high-performance heroic compilers needed! An appealing middle ground: global address space languages: CAF, Titanium, UPC The increasing size of the current parallel systems requires programming models that enhance the developers’ productiviy without compromising the performance; we would like such models to work well with a broad range of applications and systems High-level programming models already exist, but they have several drawbacks With HPF it’s not easy to obtain high-performance, requiring heroic compiler effort OpenMP doesn’t lend itself to efficient implementations on cluster platforms. This led to MPI becoming the de facto standard for parallel programming; MPI offers portability, but the developer is solely responsible for choreographing the computation and communication to achieve high performance; MPI is also difficult to program, and not amenable to compiler-based optimizations An appealing middle ground is represented by the family of global address space languages such as Co-Array Fortran, Unified Paralle C and Titanium. In this talk we will focus on Co-Array Fortran Evaluate CAF for an application with sophisticated parallelization: Sweep3D

Co-Array Fortran Global address space programming model
one-sided communication (GET/PUT) Programmer has control over performance-critical factors data distribution computation partitioning communication placement Data movement and synchronization as language primitives amenable to compiler-based communication optimization Co-array Fortran, abbreviated CAF, is a global address space programming model, and uses one-sided communication through PUT and GET operations. Developper controls performance-critical factors such as data distribution, computation partitioning and comm. Placement Data movement and synchronization are expressed at language level as primitives, making CAF amenable to compiler optimization of communication

CAF Programming Model Features
SPMD process images fixed number of images during execution images operate asynchronously Both private and shared data real x(20, 20) a private 20x20 array in each image real y(20, 20)[*] a shared 20x20 array in each image Simple one-sided shared-memory communication x(:,j:j+2) = y(:,p:p+2)[r] copy columns from image r into local columns Synchronization intrinsic functions sync_all – a barrier and a memory fence sync_mem – a memory fence sync_team([team members to notify], [team members to wait for]) Pointers and (perhaps asymmetric) dynamic allocation CAF is a SPMD programming model Enables programmer to specify both private and shared data; one uses the bracket notation to specify a co-aray To access remote data, the bracket notation is used again to specify a remote image number CAF offers language-level synchronziation such as barrier, memory fence and group communication Next I will give you a visual presentation of CAF communication

One-sided Communication with Co-Arrays
integer a(10,20)[*] if (this_image() > 1) a(1:10,1:2) = a(1:10,19:20)[this_image()-1] a(10,20) a(10,20) a(10,20) image 1 image 2 image N Copy from left neighbor h I would like to present a visual representation of co-arrays and co-array communication The bracket at the end of the declaration of a means that a is a co-array; each image has a shared 10x20 array; the collection of all these shared array represents the co-array a The communication model is one sided. Both the source and the destination are explicit. In this example every image except the first copy data from the left neighbour; this_image is a CAF primitive that returns the number of the current process image image 1 image 2 image N

Outline CAF programming model cafc Sweep3D implementations in CAF
Experimental evaluation Conclusions Next I’ll talk about the Co-array Fortran Compiler developed at Rice University.

Rice Co-Array Fortran Compiler (cafc)
First CAF multi-platform compiler previous compiler only for Cray shared memory systems Implements core of the language currently lacks support for derived type and dynamic co-arrays Core sufficient for non-trivial codes Performance comparable to that of hand-tuned MPI codes Open source CAF was previously implemented only on CRAY shared memory systems. For a parallel programming model to be attractive, it needs to have a portable implementation. At Rice University we developed the first multi-platform CAF compiler, called cafc. Cafc implements the core of CAF, enough to support non-trivial codes CAF codes compiled with cafc have achieved a performance similar to hand-tuned MPI codes Our compiler is open-source and available for download on the web

cafc Implementation Strategy
Goals portability high-performance on a wide range of platforms Source-to-source compilation of CAF codes uses Open64/SL Fortran 90 infrastructure CAF ® Fortran 90 + communication operations Communication ARMCI library for one-sided communication on clusters (PNNL) load/store communication on shared-memory platforms For portability, cafc performs a source-to-source compilation of CAF codes. We use the Open64/SL Fortran90 infrastructure; CAF codes are translated into F90 codes + communication operations. For communication we are using the portable communication library ARMCI, developed at PNNL And load/store comm on shared memory platforms

Synchronization Original CAF specification: team synchronization only
sync_all, sync_team Limits performance on loosely-coupled architectures Point-to-point extensions sync_notify(q) sync_wait(p) Point to point synchronization semantics Delivery of a notify to q from p  all communication from p to q issued before the notify has been delivered to q The original CAF synchronization model contains only team synchronization primitives, such as sync_all and sync_team These might limit the performance of CAF codes on loosely coupled architectures We proposed to extend the CAF comm model w/ point-to-point primitives, namely sync_notify And sync_wait The semantics of the point to point synchronization primitives is that on the delivery a notify to q from p, all communication from p to q issued before the notify has been delivered to q

CAF Compiler Targets (Oct 2004)
Processors Pentium, Alpha, Itanium2, MIPS Interconnects Quadrics, Myrinet, Gigabit Ethernet, shared memory Operating systems Linux, Tru64, IRIX At the moment, cafc runs on and generates for code for a wide range of architectures

Outline CAF programming model cafc Sweep3D implementations
Original MPI implementation CAF versions Communication microbenchmark Experimental evaluation Conclusions Next I will present several CAF implementations of Sweep3D

Sweep3D Core of an ASCI application Solves a
one-group time-independent discrete ordinates (Sn) 3D Cartesian (XYZ) geometry neutron transport problem Deterministic particle transport accounts for 50-80% execution time of many realistic DOE simulations Sweep3D represents the core of an ASCI application It solves a one-group, time-independent, discrete ordinates, 3D Cartesian geometry neutron transport problem It is important to implement Sweep3D efficiently because deterministic particle transport accounts for 50 to 80 percent of the time for many realistic DOE simulations. Using the original MPI version of Sweep3D

Sweep3D Parallelization
The parallel version of sweep3d uses a 2d spatial domain decomposition onto a 2D processor array And it employs wavefront parallelism. I will show a visual representation of the computation/communication pattern of sweep3d. 2D spatial domain decomposition onto a 2D processor array

The first processor computes on its data, Wavefront parallelism

Then sends information to its neighbors at west and south Wavefront parallelism

After which the first processor goes to the second iteration of its computation, while its west and south neighbors perform their first iteration. Wavefront parallelism

In the next step all three processors send data to their west and south neighbors, Wavefront parallelism

The wavefront advances, all processors on or before the wavefront are active Wavefront parallelism

Again everybody communicates, Wavefront parallelism

Let’s assume that the processors only perform three iterations, so the top left processor is done Wavefront parallelism

The process continues until all the processors have finished the computation Wavefront parallelism

Wavefront parallelism

Sweep3D Kernel Pseudocode
do iq=1,8 do mo = 1, mmo do kk = 1, kb recv e/w into Phiib recv n/s into Phijb ... ! heavy computation with use/update ! of Phiib and Phijb send e/w Phiib send n/s Phijb enddo Next I will show the pseudocode of the sweep3d kernel. The first loops control the angle and granularity of the pipeline ?

do iq=1,8 do mo = 1, mmo do kk = 1, kb recv e/w into Phiib recv n/s into Phijb ... ! heavy computation with use/update ! of Phiib and Phijb send e/w Phiib send n/s Phijb enddo Processor receive data from their e/w and n/s neighbors, if necessary

do iq=1,8 do mo = 1, mmo do kk = 1, kb recv e/w into Phiib recv n/s into Phijb ... ! heavy computation with use/update ! of Phiib and Phijb send e/w Phiib send n/s Phijb enddo They perform the necessary computation

do iq=1,8 do mo = 1, mmo do kk = 1, kb recv e/w into Phiib recv n/s into Phijb ... ! heavy computation with use/update ! of Phiib and Phijb send e/w Phiib send n/s Phijb enddo And then send data to their successors, if necessary

Initial Sweep3D CAF Implementation
Based on the MPI implementation Maintain original computation Convert communication buffers into co-arrays Fundamental issue: converting from two-sided communication into one-sided communication Next I’m going to talk about our first CAF implementation. We derived it from the MPI implementation available on the web. A crucial issue was converting from 1-sided to 2-sided communication.

2-sided vs 1-sided Communication
Let’s examine in more detail the 2- and 1-sided comm details The thread on the left is the sender and the thread on the right is the receiver 2-sided comm

MPI_Send MPI_Recv In the MPI version, the sender calls mpi_send, and the receiver performs a call to mpi_receive 2-sided comm

MPI_Send MPI_Recv There are two important points to note about the mpi comm: there is an implicit synchronization between sender and receiver: The mpi library manages communication buffers automatically 2-sided comm

MPI_Send MPI_Recv In CAF, the comm. Buffer management and synchronization are exposed at the language level The thread on the left is the source and the thread on the right is the destination 2-sided comm 1-sided comm

sync_notify sync_wait MPI_Send MPI_Recv In the general case, the source needs to receive a notification that the data on the destination can be written into, to avoid data races 2-sided comm 1-sided comm

sync_notify sync_wait PUT MPI_Send MPI_Recv Then it performs a put 2-sided comm 1-sided comm

sync_notify sync_wait PUT MPI_Send MPI_Recv sync_notify Followed by a notify. The sync_wait 2-sided comm 1-sided comm

sync_notify sync_wait PUT MPI_Send MPI_Recv sync_notify The destination consumes the notify with a sync_wait call, at which point it knows that the communication event completed, and both source and destination can advance their computation sync_wait 2-sided comm 1-sided comm

CAF Implementation Issues
Synchronization necessary to avoid data races might lead to inefficiency Using multiple communication buffers enables overlap of synchronization with computation

One- vs. Two-buffer Communication
One-buffer communication pipeline bubbles virtually no bubbles ! pipeline bubbles source dest d Two-buffers communication the notify arrives before the source calls sync_wait source dest

Asynchrony-tolerant CAF Implementation of Sweep3D
Multiple-versioned communication buffers Benefits Overlap PUT with computation on destination Overlap of synchronization with computation on source There are several ways to make CAF comm more efficient: eg less synchronization (using either transitive properties of point-to-point synchronization), using more buffer space (one buffer per communication event) We designed an asynchrony-tolerant implementation of Sweep3D using multiple versioned comm. Buffers. Which enable the overlap of a PUT by the source process image with computation on the destination process image. In the case of a cluster with support for non-blocking communication and synchronization, one could use three buffers and in the stationary state one buffer version is written into asynchronously by the predecessor, one version is used for computation by the current process image, and the last buffer version is used for a non-blocking put to successor. In practice, we have implemented the multi-versioned buffers as an array of buffers Next I will show the benefits of having multi-version buffers Stationary state for a three-versioned communication buffer: one buffer version written into asynchronously by predecessor one buffer version computed on by current process image one buffer version used for a non-blocking put to successor

Three-buffer Communication
From predecessor To successor Compute From predecessor To successor Compute From predecessor To successor Compute

Communication Throughput Microbenchmark
MPI implementation: blocking send and receive CAF one-version buffer CAF multi-versioned buffers ARMCI implementation: one buffer

Outline CAF programming model cafc Sweep3D implementations
Experimental evaluation Conclusions

Experimental Evaluation
Platforms Itanium2+Quadrics QSNet II (Elan4) SGI Altix 3000 Itanium2+Myrinet 2000 Alpha+Quadrics QSNet (Elan3) Problem sizes 50x50x50 150x150x150 300x300x300

Itanium2 + Quadrics, Size 50x50x50

multi-version buffers improve performance of CAF codes by 15% imperative to use non-blocking notifies

Itanium2+Quadrics, Communication Throughput Microbenchmark
explain axis & curves multi-version buffers improve throughput by 30 for messages up to 8KB by 10% for messages larger than 8KB overhead of the CAF translation is acceptable

SGI Altix 3000, Size 50x50x50

SGI Altix 3000, Size 150x150x150 multi-version buffers are effective for asynchrony-tolerance

SGI Altix 3000, Size 300x300x300 both CAF implementations outperforms MPI

SGI Altix 3000, Communication Throughput Microbenchmark
Warm cache ARMCI library exploits effectively the hardware support for efficient data movement MPI performs extra data copies Calling system-tuned memcpy MPI uses extra memory copy

Summary of results MPI buffering for small messages helps latency & asynchrony tolerance CAF multi-version buffers improve performance of one-sided communication for wavefront computations enables PUT and receiver’s computation to overlap asynchrony tolerance between sender and receiver Non-blocking notifies are important for performance enables synchronization to overlap with computation Platform results CAF outperforms MPI for large problem sizes by ~10% on Itanium2+{Quadrics,Myrinet,Altix} CAF ~16%slower on Alpha+Quadrics(Elan3) ARMCI lacks non-blocking notifies on Elan3

Enhancing CAF Usability
CAF vs MPI usability easier to use than MPI for simple parallel programs as difficult for carefully-tuned parallel codes Improving CAF ease of use compiler support for managing multi-version communication buffers vectorizing fine-grain communication to best support X1 and cluster platforms with same code

Implementing Communication
x(1:n) = a(1:n)[p] + … Use a temporary buffer to hold off processor data allocate buffer perform GET to fill buffer perform computation: x(1:n) = buffer(1:n) + … deallocate buffer Optimizations no temporary storage for co-array to co-array copies load/store communication on shared-memory systems

Detailed Results Itanium2+Quadrics(Elan4) Alpha+Quadrics(Elan3)
similar for 503, 9% better for 1503 and 3003 Alpha+Quadrics(Elan3) 8% better for 503, 16% lower for 1503 and similar for 3003 ARMCI lacks non-blocking notifies on Elan3 SGI Altix 3000 comparable for 503 and 1503, 10% better for 3003 Itanium2+Myrinet similar for 503, 12% better for 1503 and 9% better for 3003

SGI Altix 3000, communication throughput microbenchmark
Warm cache Cold cache

One- vs. Two-buffer Communication
One-buffer communication delays delays source dest d Two-buffers communication Ideally, we want the delay on the source image to be zero: the notify arrives before the source calls sync_wait source dest smaller delays !

Asynchrony-tolerant CAF Implementation
sync_notify sync_wait PUT one comm. buffer sync_notify sync_notify

Asynchrony-tolerant CAF Implementation
sync_notify sync_wait PUT one comm. buffer sync_notify sync_wait PUT two comm. buffers sync_notify sync_notify

Experiences with Sweep3D Implementations in Co-array Fortran

Similar presentations

Presentation on theme: "Experiences with Sweep3D Implementations in Co-array Fortran"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Experiences with Sweep3D Implementations in Co-array Fortran

Similar presentations

Presentation on theme: "Experiences with Sweep3D Implementations in Co-array Fortran"— Presentation transcript:

Similar presentations

About project

Feedback