1 A Multi-platform Co-Array Fortran Compiler Yuri Dotsenko Cristian Coarfa John Mellor-Crummey Department of Computer Science Rice University Houston,

Slides:

Advertisements

Similar presentations

Unified Parallel C at LBNL/UCB Implementing a Global Address Space Language on the Cray X1 Christian Bell and Wei Chen.

Advertisements

Introduction to the Partitioned Global Address Space (PGAS) Programming Model David E. Hudak, Ph.D. Program Director for HPC Engineering

1 An Evaluation of Global Address Space Languages: Co-Array Fortran and Unified Parallel C Cristian Coarfa, Yuri Dotsenko, John Mellor-Crummey Rice University.

Unified Parallel C at LBNL/UCB The Berkeley UPC Compiler: Implementation and Performance Wei Chen, Dan Bonachea, Jason Duell, Parry Husbands, Costin Iancu,

1 Presentation at the 4 th PMEO-PDS Workshop Benchmark Measurements of Current UPC Platforms Zhang Zhang and Steve Seidel Michigan Technological University.

Co-array Fortran Performance and Potential: an NPB Experimental Study Cristian Coarfa Yuri Dotsenko Jason Lee EckhardtJohn Mellor-Crummey Department of.

DISTRIBUTED AND HIGH-PERFORMANCE COMPUTING CHAPTER 7: SHARED MEMORY PARALLEL PROGRAMMING.

Unified Parallel C at LBNL/UCB Implementing a Global Address Space Language on the Cray X1: the Berkeley UPC Experience Christian Bell and Wei Chen CS252.

ECE669 L5: Grid Computations February 12, 2004 ECE 669 Parallel Computer Architecture Lecture 5 Grid Computations.

Java for High Performance Computing Jordi Garcia Almiñana 14 de Octubre de 1998 de la era post-internet.

Message-Passing Programming and MPI CS 524 – High-Performance Computing.

Active Messages: a Mechanism for Integrated Communication and Computation von Eicken et. al. Brian Kazian CS258 Spring 2008.

Run time vs. Compile time

The environment of the computation Declarations introduce names that denote entities. At execution-time, entities are bound to values or to locations:

Tile Reduction: the first step towards tile aware parallelization in OpenMP Ge Gan Department of Electrical and Computer Engineering Univ. of Delaware.

1 Run time vs. Compile time The compiler must generate code to handle issues that arise at run time Representation of various data types Procedure linkage.

UPC at CRD/LBNL Kathy Yelick Dan Bonachea, Jason Duell, Paul Hargrove, Parry Husbands, Costin Iancu, Mike Welcome, Christian Bell.

Unified Parallel C at LBNL/UCB Empirical (so far) Understanding of Communication Optimizations for GAS Languages Costin Iancu LBNL.

Semi-Automatic Composition of Data Layout Transformations for Loop Vectorization Shixiong Xu, David Gregg University of Dublin, Trinity College

Unified Parallel C at LBNL/UCB FT Benchmark in UPC Christian Bell and Rajesh Nishtala.

Programming for High Performance Computers John M. Levesque Director Cray’s Supercomputing Center Of Excellence.

Optimizing Threaded MPI Execution on SMP Clusters Hong Tang and Tao Yang Department of Computer Science University of California, Santa Barbara.

Lecture 29 Fall 2006 Lecture 29: Parallel Programming Overview.

Lecture 4: Parallel Programming Models. Parallel Programming Models Parallel Programming Models: Data parallelism / Task parallelism Explicit parallelism.

MPI3 Hybrid Proposal Description

Chapter 7: Runtime Environment –Run time memory organization. We need to use memory to store: –code –static data (global variables) –dynamic data objects.

OpenMP in a Heterogeneous World Ayodunni Aribuki Advisor: Dr. Barbara Chapman HPCTools Group University of Houston.

ICOM 5995: Performance Instrumentation and Visualization for High Performance Computer Systems Lecture 7 October 16, 2002 Nayda G. Santiago.

Center for Programming Models for Scalable Parallel Computing: Project Meeting Report Libraries, Languages, and Execution Models for Terascale Applications.

Programming Languages and Design Lecture 7 Subroutines and Control Abstraction Instructor: Li Ma Department of Computer Science Texas Southern University,

1 John Mellor-Crummey Cristian Coarfa, Yuri Dotsenko Department of Computer Science Rice University Experiences Building a Multi-platform Compiler for.

Compilation Technology SCINET compiler workshop | February 17-18, 2009 © 2009 IBM Corporation Software Group Coarray: a parallel extension to Fortran Jim.

AN EXTENDED OPENMP TARGETING ON THE HYBRID ARCHITECTURE OF SMP-CLUSTER Author ： Y. Zhao 、 C. Hu 、 S. Wang 、 S. Zhang Source ： Proceedings of the 2nd IASTED.

UPC Applications Parry Husbands. Roadmap Benchmark small applications and kernels —SPMV (for iterative linear/eigen solvers) —Multigrid Develop sense.

Co-Array Fortran Open-source compilers and tools for scalable global address space computing John Mellor-Crummey Rice University.

HPC User Forum Back End Compiler Panel SiCortex Perspective Kevin Harris Compiler Manager April 2009.

High-Level, One-Sided Models on MPI: A Case Study with Global Arrays and NWChem James Dinan, Pavan Balaji, Jeff R. Hammond (ANL); Sriram Krishnamoorthy.

Spring 2003CSE P5481 Issues in Multiprocessors Which programming model for interprocessor communication shared memory regular loads & stores message passing.

MPI (continue) An example for designing explicit message passing programs Advanced MPI concepts.

Copyright © 2009 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Principles of Parallel Programming First Edition by Calvin Lin Lawrence Snyder.

University of Minnesota Comments on Co-Array Fortran Robert W. Numrich Minnesota Supercomputing Institute University of Minnesota, Minneapolis.

RUN-Time Organization Compiler phase— Before writing a code generator, we must decide how to marshal the resources of the target machine (instructions,

October 11, 2007 © 2007 IBM Corporation Multidimensional Blocking in UPC Christopher Barton, Călin Caşcaval, George Almási, Rahul Garg, José Nelson Amaral,

A Multi-platform Co-array Fortran Compiler for High-Performance Computing John Mellor-Crummey, Yuri Dotsenko, Cristian Coarfa {johnmc, dotsenko,

1 Qualifying ExamWei Chen Unified Parallel C (UPC) and the Berkeley UPC Compiler Wei Chen the Berkeley UPC Group 3/11/07.

Introduction to OpenMP Eric Aubanel Advanced Computational Research Laboratory Faculty of Computer Science, UNB Fredericton, New Brunswick.

HPC Components for CCA Manoj Krishnan and Jarek Nieplocha Computational Sciences and Mathematics Division Pacific Northwest National Laboratory.

Message-Passing Computing Chapter 2. Programming Multicomputer Design special parallel programming language –Occam Extend existing language to handle.

Unified Parallel C Kathy Yelick EECS, U.C. Berkeley and NERSC/LBNL NERSC Team: Dan Bonachea, Jason Duell, Paul Hargrove, Parry Husbands, Costin Iancu,

Experiences with Co-array Fortran on Hardware Shared Memory Platforms Yuri DotsenkoCristian Coarfa John Mellor-CrummeyDaniel Chavarria-Miranda Rice University,

OpenMP for Networks of SMPs Y. Charlie Hu, Honghui Lu, Alan L. Cox, Willy Zwaenepoel ECE1747 – Parallel Programming Vicky Tsang.

A Multi-platform Co-Array Fortran Compiler for High-Performance Computing Cristian Coarfa, Yuri Dotsenko, John Mellor-Crummey {dotsenko, ccristi,

An Evaluation of Data-Parallel Compiler Support for Line-Sweep Applications Daniel Chavarría-Miranda John Mellor-Crummey Dept. of Computer Science Rice.

3/12/2013Computer Engg, IIT(BHU)1 PARALLEL COMPUTERS- 2.

3/12/2013Computer Engg, IIT(BHU)1 OpenMP-1. OpenMP is a portable, multiprocessing API for shared memory computers OpenMP is not a “language” Instead,

1 HPJAVA I.K.UJJWAL 07M11A1217 Dept. of Information Technology B.S.I.T.

Benchmarking and Applications. Purpose of Our Benchmarking Effort Reveal compiler (and run-time systems) weak points and lack of adequate automatic optimizations.

LECTURE 19 Subroutines and Parameter Passing. ABSTRACTION Recall: Abstraction is the process by which we can hide larger or more complex code fragments.

Parallel Performance Wizard: A Generalized Performance Analysis Tool Hung-Hsun Su, Max Billingsley III, Seth Koehler, John Curreri, Alan D. George PPW.

Global Trees: A Framework for Linked Data Structures on Distributed Memory Parallel Systems D. Brian Larkins, James Dinan, Sriram Krishnamoorthy, Srinivasan.

Co-array Fortran: Compilation, Performance, Languages Issues Cristian Coarfa Yuri Dotsenko John Mellor-Crummey Department of Computer Science Rice University.

A Parallel Communication Infrastructure for STAPL

Support for Program Analysis as a First-Class Design Constraint in Legion Michael Bauer 02/22/17.

An Emerging, Portable Co-Array Fortran Compiler for High-Performance Computing Daniel Chavarría-Miranda, Cristian Coarfa, Yuri.

Department of Computer Science University of California, Santa Barbara

Experiences with Sweep3D Implementations in Co-array Fortran

Department of Computer Science University of California, Santa Barbara

RUN-TIME STORAGE Chuen-Liang Chen Department of Computer Science

Support for Adaptivity in ARMCI Using Migratable Objects

Presentation transcript:

1 A Multi-platform Co-Array Fortran Compiler Yuri Dotsenko Cristian Coarfa John Mellor-Crummey Department of Computer Science Rice University Houston, TX USA

2 Motivation Parallel Programming Models MPI: de facto standard –difficult to program OpenMP: inefficient to map on distributed memory platforms –lack of locality control HPF: hard to obtain high-performance –heroic compilers needed! Global address space languages: CAF, Titanium, UPC an appealing middle ground

3 Co-Array Fortran Global address space programming model –one-sided communication (GET/PUT) Programmer has control over performance-critical factors –data distribution –computation partitioning –communication placement Data movement and synchronization as language primitives –amenable to compiler-based communication optimization

4 CAF Programming Model Features SPMD process images –fixed number of images during execution –images operate asynchronously Both private and shared data –real x(20, 20) a private 20x20 array in each image –real y(20, 20)[*] a shared 20x20 array in each image Simple one-sided shared-memory communication –x(:,j:j+2) = y(:,p:p+2)[r] copy columns from image r into local columns Synchronization intrinsic functions –sync_all – a barrier and a memory fence –sync_mem – a memory fence –sync_team([team members to notify], [team members to wait for]) Pointers and (perhaps asymmetric) dynamic allocation

5 integer a(10,20)[*] if (this_image() > 1) a(1:10,1:2) = a(1:10,19:20)[this_image()-1] a(10,20) image 1image 2image N image 1image 2image N One-sided Communication with Co-Arrays

6 Rice Co-Array Fortran Compiler ( cafc ) First CAF multi-platform compiler –previous compiler only for Cray shared memory systems Implements core of the language –currently lacks support for derived type and dynamic co-arrays Core sufficient for non-trivial codes Performance comparable to that of hand-tuned MPI codes Open source

7 Outline CAF programming model cafc  Core language implementation –Optimizations Experimental evaluation Conclusions

8 Implementation Strategy Source-to-source compilation of CAF codes –uses Open64/ SL Fortran 90 infrastructure –CAF  Fortran 90 + communication operations Communication –ARMCI library for one-sided communication on clusters –load/store communication on shared-memory platforms Goals –portability –high-performance on a wide range of platforms

9 Co-Array Descriptors Initialize and manipulate Fortran 90 dope vectors real :: a(10,10,10)[*] type CAFDesc_real_3 integer(ptrkind) :: handle ! Opaque handle ! to CAF runtime representation real, pointer:: ptr(:,:,:) ! Fortran 90 pointer ! to local co-array data end Type CAFDesc_real_3 type(CAFDesc_real_3):: a

10 Allocating COMMON and SAVE Co-Arrays Compiler –generates static initializer for each common/save variable Linker –collects calls to all initializers –generates global initializer that calls all others –compiles global initializer and links into program Launch –invokes global initializer before main program begins allocates co-array storage outside Fortran 90 runtime system associates co-array descriptors with allocated memory Similar to handling for C++ static constructors

11 Parameter Passing Call-by-value convention (copy-in, copy-out) –pass remote co-array data to procedures only as values Call-by-co-array convention* –argument declared as a co-array by callee –enables access to local and remote co-array data Call-by-reference convention* ( cafc ) –argument declared as an explicit shape array –enables access to local co-array data only –enables reuse of existing Fortran code * requires an explicit interface call f (( a(I)[p] )) subroutine f(a) real :: a(10)[*] real :: x(10)[*] call f(x) subroutine f(a) real :: a(10)

12 Multiple Co-dimensions Managing processors as a logical multi-dimensional grid integer a(10,10)[5,4,*] 3D processor grid 5 x 4 x … Support co-space reshaping at procedure calls –change number of co-dimensions –co-space bounds as procedure arguments

13 Implementing Communication x(1:n) = a(1:n)[p] + … Use a temporary buffer to hold off processor data –allocate buffer –perform GET to fill buffer –perform computation: x(1:n) = buffer(1:n) + … –deallocate buffer Optimizations –no temporary storage for co-array to co-array copies –load/store communication on shared-memory systems

14 Synchronization Original CAF specification: team synchronization only –sync_all, sync_team Limits performance on loosely-coupled architectures Point-to-point extensions –sync_notify(q) –sync_wait(p) Point to point synchronization semantics Delivery of a notify to q from p  all communication from p to q issued before the notify has been delivered to q

15 Outline CAF programming model cafc –Core language implementation  Optimizations procedure splitting supporting hints for non-blocking communication packing strided communications Experimental evaluation Conclusions

16 An Impediment to Code Efficiency Original reference rhs(1,i,j,k,c) = … + u(1,i-1,j,k,c) - … Transformed reference rhs%ptr(1,i,j,k,c) = … + u%ptr(1,i-1,j,k,c) - … Fortran 90 pointer-based co-array representation does not convey –the lack of co-array aliasing –co-array contiguity –co-array bounds Lack of knowledge inhibits important code optimizations

17 Procedure Splitting subroutine f(…) real, save :: c(100)[*] interface subroutine f_inner(…, c_arg) real :: c_arg[*] end subroutine f_inner end interface call f_inner(…,c) end subroutine f subroutine f_inner(…, c_arg) real :: c_arg(100)[*]... = c_arg(50)... end subroutine f_inner subroutine f(…) real, save :: c(100)[*]... = c(50)... end subroutine f CAF to CAF preprocessing

18 Benefits of Procedure Splitting Generated code conveys –lack of co-array aliasing –co-array contiguity –co-array bounds Enables back-end compiler to generate better code

19 Hiding Communication Latency Goal: enable communication/computation overlap Impediments to generating non-blocking communication –use of indexed subscripts in co-dimensions –lack of whole program analysis Approach: support hints for non-blocking communication –overcome conservative compiler analysis –enable sophisticated programmers to achieve good performance today

20 Hints for Non-blocking PUTs Hints for CAF run-time system to issue non-blocking PUTs region_id = open_nb_put_region()... Put_Stmt_1... Put_Stmt_N... call close_nb_put_region(region_id) Complete non-blocking PUTs: call complete_nb_put_region(region_id) Open problem: Exploiting non-blocking GETs?

21 Strided vs. Contiguous Transfers Problem CAF remote reference might induce many small data transfers a(i,1:n)[p] = b(j,1:n) Solution pack strided data on source and unpack it on destination

22 Pragmatics of Packing Who should implement packing? The CAF programmer –difficult to program The CAF compiler –unpacking requires conversion of PUTs into two-sided communication (a difficult whole-program transformation) The communication library –most natural place –ARMCI currently performs packing on Myrinet

23 CAF Compiler Targets (Sept 2004) Processors –Pentium, Alpha, Itanium2, MIPS Interconnects –Quadrics, Myrinet, Gigabit Ethernet, shared memory Operating systems –Linux, Tru64, IRIX

24 Outline CAF programming model cafc –Core language implementation –Optimizations  Experimental evaluation Conclusions

25 Experimental Evaluation Platforms –Alpha+Quadrics QSNet (Elan3) –Itanium2+Quadrics QSNet II (Elan4) –Itanium2+Myrinet 2000 Codes –NAS Parallel Benchmarks (NPB 2.3) from NASA Ames

26 NAS BT Efficiency (Class C)

27 NAS SP Efficiency (Class C) lack of non-blocking notify implementation blocks CAF comm/comp overlap

28 NAS MG Efficiency (Class C) ARMCI comm is efficient pt-2-pt synch in boosts CAF performance 30%

29 NAS CG Efficiency (Class C)

30 NAS LU Efficiency (class C)

31 Impact of Optimizations Assorted Results Procedure splitting –42-60% improvement for BT on Itanium2+Myrinet cluster –15-33% improvement for LU on Alpha+Quadrics Non-blocking communication generation –5% improvement for BT on Itanium2+Quadrics cluster –3% improvement for MG on all platforms Packing of strided data –31% improvement for BT on Alpha+Quadrics cluster –37% improvement for LU on Itanium2+Quadrics cluster See paper for more details

32 Conclusions CAF boosts programming productivity –simplifies the development of SPMD parallel programs –shifts details of managing communication to compiler cafc delivers performance comparable to hand-tuned MPI cafc implements effective optimizations –procedure splitting –non-blocking communication –packing of strided communication (in ARMCI) Vectorization needed to achieve true performance portability with machines like Cray X1