Co-array Fortran: Compilation, Performance, Languages Issues Cristian Coarfa Yuri Dotsenko John Mellor-Crummey Department of Computer Science Rice University.

Slides:



Advertisements
Similar presentations
Unified Parallel C at LBNL/UCB Implementing a Global Address Space Language on the Cray X1 Christian Bell and Wei Chen.
Advertisements

Implementation of 2-D FFT on the Cell Broadband Engine Architecture William Lundgren Gedae), Kerry Barnes (Gedae), James Steed (Gedae)
1 An Evaluation of Global Address Space Languages: Co-Array Fortran and Unified Parallel C Cristian Coarfa, Yuri Dotsenko, John Mellor-Crummey Rice University.
Computer Systems/Operating Systems - Class 8
Unified Parallel C at LBNL/UCB The Berkeley UPC Compiler: Implementation and Performance Wei Chen, Dan Bonachea, Jason Duell, Parry Husbands, Costin Iancu,
1 Presentation at the 4 th PMEO-PDS Workshop Benchmark Measurements of Current UPC Platforms Zhang Zhang and Steve Seidel Michigan Technological University.
Co-array Fortran Performance and Potential: an NPB Experimental Study Cristian Coarfa Yuri Dotsenko Jason Lee EckhardtJohn Mellor-Crummey Department of.
DISTRIBUTED AND HIGH-PERFORMANCE COMPUTING CHAPTER 7: SHARED MEMORY PARALLEL PROGRAMMING.
Unified Parallel C at LBNL/UCB Implementing a Global Address Space Language on the Cray X1: the Berkeley UPC Experience Christian Bell and Wei Chen CS252.
Active Messages: a Mechanism for Integrated Communication and Computation von Eicken et. al. Brian Kazian CS258 Spring 2008.
UPC at CRD/LBNL Kathy Yelick Dan Bonachea, Jason Duell, Paul Hargrove, Parry Husbands, Costin Iancu, Mike Welcome, Christian Bell.
Introduction to Symmetric Multiprocessors Süha TUNA Bilişim Enstitüsü UHeM Yaz Çalıştayı
Lecture 29 Fall 2006 Lecture 29: Parallel Programming Overview.
Lecture 4: Parallel Programming Models. Parallel Programming Models Parallel Programming Models: Data parallelism / Task parallelism Explicit parallelism.
1 Lecture 4: Threads Operating System Fall Contents Overview: Processes & Threads Benefits of Threads Thread State and Operations User Thread.
ICOM 5995: Performance Instrumentation and Visualization for High Performance Computer Systems Lecture 7 October 16, 2002 Nayda G. Santiago.
Parallel Programming Models Jihad El-Sana These slides are based on the book: Introduction to Parallel Computing, Blaise Barney, Lawrence Livermore National.
1 A Multi-platform Co-Array Fortran Compiler Yuri Dotsenko Cristian Coarfa John Mellor-Crummey Department of Computer Science Rice University Houston,
1 John Mellor-Crummey Cristian Coarfa, Yuri Dotsenko Department of Computer Science Rice University Experiences Building a Multi-platform Compiler for.
Compilation Technology SCINET compiler workshop | February 17-18, 2009 © 2009 IBM Corporation Software Group Coarray: a parallel extension to Fortran Jim.
AN EXTENDED OPENMP TARGETING ON THE HYBRID ARCHITECTURE OF SMP-CLUSTER Author : Y. Zhao 、 C. Hu 、 S. Wang 、 S. Zhang Source : Proceedings of the 2nd IASTED.
Co-Array Fortran Open-source compilers and tools for scalable global address space computing John Mellor-Crummey Rice University.
Chapter 8 CPU and Memory: Design, Implementation, and Enhancement The Architecture of Computer Hardware and Systems Software: An Information Technology.
Issues Autonomic operation (fault tolerance) Minimize interference to applications Hardware support for new operating systems Resource management (global.
Ch 10 Shared memory via message passing Problems –Explicit user action needed –Address spaces are distinct –Small Granularity of Transfer Distributed Shared.
OPERATING SYSTEM SUPPORT DISTRIBUTED SYSTEMS CHAPTER 6 Lawrence Heyman July 8, 2002.
High-Level, One-Sided Models on MPI: A Case Study with Global Arrays and NWChem James Dinan, Pavan Balaji, Jeff R. Hammond (ANL); Sriram Krishnamoorthy.
University of Minnesota Comments on Co-Array Fortran Robert W. Numrich Minnesota Supercomputing Institute University of Minnesota, Minneapolis.
Introduction to c++ programming - object oriented programming concepts - Structured Vs OOP. Classes and objects - class definition - Objects - class scope.
August 2001 Parallelizing ROMS for Distributed Memory Machines using the Scalable Modeling System (SMS) Dan Schaffer NOAA Forecast Systems Laboratory (FSL)
Evaluating and Improving an OpenMP-based Circuit Design Tool Tim Beatty, Dr. Ken Kent, Dr. Eric Aubanel Faculty of Computer Science University of New Brunswick.
A Multi-platform Co-array Fortran Compiler for High-Performance Computing John Mellor-Crummey, Yuri Dotsenko, Cristian Coarfa {johnmc, dotsenko,
1 Qualifying ExamWei Chen Unified Parallel C (UPC) and the Berkeley UPC Compiler Wei Chen the Berkeley UPC Group 3/11/07.
Introduction to OpenMP Eric Aubanel Advanced Computational Research Laboratory Faculty of Computer Science, UNB Fredericton, New Brunswick.
Department of Computer Science MapReduce for the Cell B. E. Architecture Marc de Kruijf University of Wisconsin−Madison Advised by Professor Sankaralingam.
Gedae, Inc. Gedae: Auto Coding to a Virtual Machine Authors: William I. Lundgren, Kerry B. Barnes, James W. Steed HPEC 2004.
Message-Passing Computing Chapter 2. Programming Multicomputer Design special parallel programming language –Occam Extend existing language to handle.
Experiences with Co-array Fortran on Hardware Shared Memory Platforms Yuri DotsenkoCristian Coarfa John Mellor-CrummeyDaniel Chavarria-Miranda Rice University,
A Multi-platform Co-Array Fortran Compiler for High-Performance Computing Cristian Coarfa, Yuri Dotsenko, John Mellor-Crummey {dotsenko, ccristi,
 Dan Ibanez, Micah Corah, Seegyoung Seol, Mark Shephard  2/27/2013  Scientific Computation Research Center  Rensselaer Polytechnic Institute 1 Advances.
Experiences with VI Communication for Database Storage Yuanyuan Zhou, Angelos Bilas, Suresh Jagannathan, Cezary Dubnicki, Jammes F. Philbin, Kai Li.
Application of Design Patterns to Geometric Decompositions V. Balaji, Thomas L. Clune, Robert W. Numrich and Brice T. Womack.
Group Members Hamza Zahid (131391) Fahad Nadeem khan Abdual Hannan AIR UNIVERSITY MULTAN CAMPUS.
A Parallel Communication Infrastructure for STAPL
UPC at NERSC/LBNL Kathy Yelick, Christian Bell, Dan Bonachea,
Support for Program Analysis as a First-Class Design Constraint in Legion Michael Bauer 02/22/17.
An Emerging, Portable Co-Array Fortran Compiler for High-Performance Computing Daniel Chavarría-Miranda, Cristian Coarfa, Yuri.
The Mach System Sri Ramkrishna.
CS703 - Advanced Operating Systems
Introduction to cosynthesis Rabi Mahapatra CSCE617
The Future of Fortran is Bright …
Chapter 4: Threads.
Department of Computer Science University of California, Santa Barbara
MPI-Message Passing Interface
Milind A. Bhandarkar Adaptive MPI Milind A. Bhandarkar
Compiler Back End Panel
Modified by H. Schulzrinne 02/15/10 Chapter 4: Threads.
Experiences with Sweep3D Implementations in Co-array Fortran
Compiler Back End Panel
Outline Module 1 and 2 dealt with processes, scheduling and synchronization Next two modules will deal with memory and storage Processes require data to.
Channels.
Channels.
Channels.
Department of Computer Science University of California, Santa Barbara
RUN-TIME STORAGE Chuen-Liang Chen Department of Computer Science
Support for Adaptivity in ARMCI Using Migratable Objects
Parallel I/O for Distributed Applications (MPI-Conn-IO)
CUDA Fortran Programming with the IBM XL Fortran Compiler
Chapter 13: I/O Systems “The two main jobs of a computer are I/O and [CPU] processing. In many cases, the main job is I/O, and the [CPU] processing is.
Programming Parallel Computers
Presentation transcript:

Co-array Fortran: Compilation, Performance, Languages Issues Cristian Coarfa Yuri Dotsenko John Mellor-Crummey Department of Computer Science Rice University

2 Outline Co-array Fortran language overview Rice CAF compiler A preliminary performance study Ongoing research Conclusions

3 Co-array Fortran (CAF) Explicitly-parallel extension of Fortran 90/95 defined by Numrich & Reid (see Global address space SPMD parallel programming model one-sided communication Simple, two-level memory model for locality management local vs. remote memory Programmer control over performance critical decisions data partitioning communication Suitable for mapping to a range of parallel architectures shared memory, message passing, hybrid, PIM

4 CAF Programming Model Features SPMD process images fixed number of images during execution images operate asynchronously Both private and shared data real x(20, 20) a private 20x20 array in each image real y(20, 20) [*] a shared 20x20 array in each image Simple one-sided shared-memory communication x(:,j:j+2) = y(:,p:p+2) [r] copy columns from p:p+2 into local columns

5 One-sided Communication with Co-Arrays integer a(10,20)[*] if (this_image() > 1) a(1:10,1:2) = a(1:10,19:20)[this_image()-1] a(10,20) image 1image 2image N image 1image 2image N

6 CAF Programming Model Features Synchronization intrinsic functions sync_all – a barrier and a memory fence sync_mem – a memory fence sync_team([notify], [wait]) notify = a vector of process ids to signal wait = a vector of process ids to wait for, a subset of notify Pointers and (perhaps asymmetric) dynamic allocation Parallel I/O

7 CAF Language Assessment Strengths data movement and synchronization as language primitives amenable to compiler optimization offloads communication management to the compiler choreographing data transfer managing mechanics of synchronization gives user full control of parallelization array syntax supports natural user-level vectorization modest compiler technology can yield good performance more abstract than MPI  better performance portability Weaknesses user manages partitioning of work user specifies data movement user codes necessary synchronization

8 Outline Co-array Fortran language overview Rice CAF Compiler A preliminary performance study Ongoing research Conclusions

9 Compiler Goals Portable, open-source compiler Multi-platform code generation High performance generated code

10 Compilation Approach Source-to-source Translation Translate CAF into Fortran 90 + communication calls Benefits wide portability leverage vendor F90 compilers for good node performance One-sided communication layer strided communication gather/scatter synchronization: barriers, notify/wait split phase non-blocking primitives Today: ARMCI: remote memory copy interface PNL)

11 Co-array Data Co-array representation F90 pointer to data + opaque handle for communication layer Co-array access read/write local co-array data using F90 pointer dereference remote accesses translate into GET/PUT calls Co-array allocation storage allocation by communication layer, as appropriate on shared memory hardware: in shared memory segment on clusters: in pinned memory for DMA access dope vector initialization using CHASM LANL) set F90 pointer to point to externally managed memory collect co-array initializers at link time: global initializer call global initializer at program launch

12 Implementing Communication Given a statement X(1:n) = A(1:n)[p] + … A temporary buffer is used for off processor data invoke communication library to allocate tmp in suitable temporary storage dope vector filled in so tmp can be accessed as F90 pointer call communication library to fill in tmp (GET) X(1:n) = tmp(1:n) + … deallocate tmp Optimizations Co-array to co-array communication: no temporary storage On shared-memory systems: direct load/store

13 Porting to a new Compiler / Architecture Synthesize dope vectors for co-array storage compiler/architecture specific details: CHASM library Tailor communication to architecture design supports alternate communication libraries status today: ARMCI (PNL) ongoing work: compiler tailored communication –direct load/store on shared-memory architectures future –other portable libraries (e.g. GASnet) –custom communication library for an architecture

14 CAF Compiler Status Near production-quality F90 front end from Open64 being enhanced to meet needs of this project and others Working prototype for CAF core features Co-array communication inserted around statements with co-array accesses currently no optimization

15 Supported Features Declarations co-objects: scalars and arrays COMMON and SAVE co-objects of primitive types COMMON blocks: variables and co-objects intermixed co-objects with multiple co-dimensions procedure interface blocks with co-array arguments Executable code array section notation for co-array data indices local and remote co-arrays co-array argument passing co-array dummy arguments require explicit interface co-array pointer + communication handle co-array reshaping supported CAF intrinsics Image inquiry: this_image(…), num_images() Synchronization: sync_all, sync_team, synch_notify, synch_wait

16 Coming Attractions Allocatable co-arrays REAL(8), ALLOCATABLE :: X(:)[*] ALLOCATE(X(MYX_NUM)[*]) Co-arrays of user-defined types Allocatable co-array components user defined type with pointer components Triplets in co-dimensions A(j,k)[p+1:p+4]

17 CAF Compiler Targets (May 2004) SGI Altix 3000/Itanium2, Linux64 RedHat 7.2 SGI Origin 2000/MIPS, IRIX Pentium+Ethernet workstations, Linux32 RedHat 7.1 Itanium2 + Myrinet, Linux64 RedHat 7.1 Itanium2 + Quadrics, Linux64 RedHat 7.1 Alphaserver SC + Quadrics, OSF1 Tru64 V5.1A

18 Outline Co-array Fortran language overview Rice CAF Compiler A preliminary performance study Ongoing research Conclusions

19 A Preliminary Performance Study Platforms SGI Altix 3000 (Itanium2 1.5GHz) Itanium2+Quadrics QSNet II (Elan4, Itanium2 1.5GHz) SGI Origin 2000 (Mips R12000) Codes NAS Parallel Benchmarks (NPB) from NASA Ames CAF STREAM benchmark

20 NAS Parallel Benchmarks (NPB) 2.3 Benchmarks by NASA Ames 2-3K lines each (Fortran 77) Widely used to test parallel compiler performance BT, SP, MG and CG NAS versions: NPB2.3b2 : Hand-coded MPI NPB2.3-serial : Serial code extracted from MPI version Our version NPB2.3-CAF: CAF implementation, based on the MPI version

21 NAS BT Efficiency (Class C, size ) Lesson Tradeoff: # buffers vs. synchronization more buffers = less synchronization less synchronization improved performance

22 NAS SP Efficiency (Class C, size ) Lesson Inability to overlap communication with computation in procedure calls hurts performance

23 NAS MG Efficiency (Class C, size ) Lessons Replacing barriers with point-to-point synchronization can boost performance 30% Converting GETs into PUTs also improved performance

24 NAS CG Efficiency (Class C, size ) Lessons aggregation and vectorization are critical for high performance communication memory layout of buffers and arrays might require thorough analysis and optimization

25 CAF STREAM benchmark Derived from STREAM Synthetic benchmark Measures sustainable memory bandwidth and computation rate for small kernels CAF STREAM Copy, Scale, Add, Triad Local and remote get versions Fine- and coarse-grain access Aim: determining the most efficient representation for co-arrays on particular platforms

26 CAF STREAM * * Results as of 07/20/2004

27 CAF STREAM * * Results as of 07/20/2004

28 CAF STREAM * * Results as of 07/20/2004

29 Experiments Summary To achieve high performance with CAF, a user or compiler must vectorize (and perhaps aggregate) communication reduce synchronization strength replace all-to-all with point-to-point where sensible overlap communication with computation convert GETS into PUTS where gets are not a h/w primitive consider memory layout conflicts: co-array vs. regular data generate code amenable for back-end compiler optimizations CAF language: many optimizations possible at the source level Compiler optimizations NECESSARY for portable coding style might need user hints where synchronization analysis falls short Runtime issues Register co-array memory for direct transfers where necessary

30 Outline Co-array Fortran language overview Rice CAF Compiler A preliminary performance study Ongoing research Conclusions

31 CAF Language Refinement Issues Overly restrictive synchronization primitives add unidirectional, point-to-point synchronization rework model for user-defined teams No collective operations leads to home-brew non-portable implementations add CAF intrinsics for reductions, broadcast, etc. Blocking communication reduces scalability user mechanisms to delay completion to enable overlap? Synchronization is not paired with data movement synchronization hint tags to help analysis synchronization tags at run-time to track completion? Relaxing the memory model for high performance enable programmer to overlap one-sided communication with procedure calls

32 CAF Compiler Research Directions Aim for performance transparency Compiler optimization of communication and I/O optimize communication communication vectorization and aggregation multi-mode communication: direct load/store + RDMA transform from one-sided to two-sided communication transform from get to put communication optimize synchronization synchronization strength reduction exploit split-phase operations for overlap with computation combine synchronization with communication collective communication put with flag platform-tailored optimization Interoperability with other parallel programming models Optimizations to improve node performance

33 Conclusions Tuned CAF performance is comparable to tuned MPI even without compiler-based communication optimizations! CAF programming model enables source-level optimization communication vectorization synchronization strength reduction achieve performance today rather than waiting for tomorrow’s compilers CAF is amenable to compiler analysis and optimization significant communication optimization is feasible, unlike for MPI optimizing compilers will help a wider range of programs achieve high performance applications can be tailored to fully exploit architectural characteristics e.g., shared memory vs. distributed memory vs. hybrid

34 Project URL