1 John Mellor-Crummey Cristian Coarfa, Yuri Dotsenko Department of Computer Science Rice University Experiences Building a Multi-platform Compiler for.

Slides:

Advertisements

Similar presentations

C. Bell, D. Bonachea, R. Nishtala, and K. Yelick, 1Berkeley UPC: Optimizing Bandwidth Limited Problems Using One-Sided Communication.

Advertisements

Unified Parallel C at LBNL/UCB Implementing a Global Address Space Language on the Cray X1 Christian Bell and Wei Chen.

Introduction to the Partitioned Global Address Space (PGAS) Programming Model David E. Hudak, Ph.D. Program Director for HPC Engineering

More Tools Done basic, most useful programming tools: MPI OpenMP What else is out there? Languages and compilers? Parallel computing is significantly more.

1 An Evaluation of Global Address Space Languages: Co-Array Fortran and Unified Parallel C Cristian Coarfa, Yuri Dotsenko, John Mellor-Crummey Rice University.

Unified Parallel C at LBNL/UCB The Berkeley UPC Compiler: Implementation and Performance Wei Chen, Dan Bonachea, Jason Duell, Parry Husbands, Costin Iancu,

1 Presentation at the 4 th PMEO-PDS Workshop Benchmark Measurements of Current UPC Platforms Zhang Zhang and Steve Seidel Michigan Technological University.

MPI in uClinux on Microblaze Neelima Balakrishnan Khang Tran 05/01/2006.

Co-array Fortran Performance and Potential: an NPB Experimental Study Cristian Coarfa Yuri Dotsenko Jason Lee EckhardtJohn Mellor-Crummey Department of.

Application of Fortran 90 to ocean model codes Mark Hadfield National Institute of Water and Atmospheric Research New Zealand.

DISTRIBUTED AND HIGH-PERFORMANCE COMPUTING CHAPTER 7: SHARED MEMORY PARALLEL PROGRAMMING.

Tuesday, September 12, 2006 Nothing is impossible for people who don't have to do it themselves. - Weiler.

Overview *Unified Parallel C is an extension to ANSI C. *UPC is a global address space language for parallel programming. * UPC extends C by providing.

Unified Parallel C at LBNL/UCB Implementing a Global Address Space Language on the Cray X1: the Berkeley UPC Experience Christian Bell and Wei Chen CS252.

Java for High Performance Computing Jordi Garcia Almiñana 14 de Octubre de 1998 de la era post-internet.

Performance Implications of Communication Mechanisms in All-Software Global Address Space Systems Chi-Chao Chang Dept. of Computer Science Cornell University.

Active Messages: a Mechanism for Integrated Communication and Computation von Eicken et. al. Brian Kazian CS258 Spring 2008.

State Machines Timing Computer Bus Computer Performance Instruction Set Architectures RISC / CISC Machines.

Multiprocessors CSE 471 Aut 011 Multiprocessors - Flynn’s Taxonomy (1966) Single Instruction stream, Single Data stream (SISD) –Conventional uniprocessor.

Tile Reduction: the first step towards tile aware parallelization in OpenMP Ge Gan Department of Electrical and Computer Engineering Univ. of Delaware.

UPC at CRD/LBNL Kathy Yelick Dan Bonachea, Jason Duell, Paul Hargrove, Parry Husbands, Costin Iancu, Mike Welcome, Christian Bell.

Semi-Automatic Composition of Data Layout Transformations for Loop Vectorization Shixiong Xu, David Gregg University of Dublin, Trinity College

Unified Parallel C at LBNL/UCB FT Benchmark in UPC Christian Bell and Rajesh Nishtala.

Programming for High Performance Computers John M. Levesque Director Cray’s Supercomputing Center Of Excellence.

Lecture 29 Fall 2006 Lecture 29: Parallel Programming Overview.

Lecture 4: Parallel Programming Models. Parallel Programming Models Parallel Programming Models: Data parallelism / Task parallelism Explicit parallelism.

OpenMP in a Heterogeneous World Ayodunni Aribuki Advisor: Dr. Barbara Chapman HPCTools Group University of Houston.

ICOM 5995: Performance Instrumentation and Visualization for High Performance Computer Systems Lecture 7 October 16, 2002 Nayda G. Santiago.

1 A Multi-platform Co-Array Fortran Compiler Yuri Dotsenko Cristian Coarfa John Mellor-Crummey Department of Computer Science Rice University Houston,

Center for Programming Models for Scalable Parallel Computing: Project Meeting Report Libraries, Languages, and Execution Models for Terascale Applications.

Unified Parallel C at LBNL/UCB The Berkeley UPC Compiler: Implementation and Performance Wei Chen the LBNL/Berkeley UPC Group.

Compilation Technology SCINET compiler workshop | February 17-18, 2009 © 2009 IBM Corporation Software Group Coarray: a parallel extension to Fortran Jim.

AN EXTENDED OPENMP TARGETING ON THE HYBRID ARCHITECTURE OF SMP-CLUSTER Author ： Y. Zhao 、 C. Hu 、 S. Wang 、 S. Zhang Source ： Proceedings of the 2nd IASTED.

UPC Applications Parry Husbands. Roadmap Benchmark small applications and kernels —SPMV (for iterative linear/eigen solvers) —Multigrid Develop sense.

1 Advance Computer Architecture CSE 8383 Ranya Alawadhi.

Co-Array Fortran Open-source compilers and tools for scalable global address space computing John Mellor-Crummey Rice University.

Case Study in Computational Science & Engineering - Lecture 2 1 Parallel Architecture Models Shared Memory –Dual/Quad Pentium, Cray T90, IBM Power3 Node.

HPC User Forum Back End Compiler Panel SiCortex Perspective Kevin Harris Compiler Manager April 2009.

1 Parallel Programming Aaron Bloomfield CS 415 Fall 2005.

High-Level, One-Sided Models on MPI: A Case Study with Global Arrays and NWChem James Dinan, Pavan Balaji, Jeff R. Hammond (ANL); Sriram Krishnamoorthy.

Spring 2003CSE P5481 Issues in Multiprocessors Which programming model for interprocessor communication shared memory regular loads & stores message passing.

Improving I/O with Compiler-Supported Parallelism Why Should We Care About I/O? Disk access speeds are much slower than processor and memory access speeds.

University of Minnesota Comments on Co-Array Fortran Robert W. Numrich Minnesota Supercomputing Institute University of Minnesota, Minneapolis.

Software Caching for UPC Wei Chen Jason Duell Jimmy Su Spring 2003.

© 2009 IBM Corporation Parallel Programming with X10/APGAS IBM UPC and X10 teams  Through languages –Asynchronous Co-Array Fortran –extension of CAF with.

Unified Parallel C at LBNL/UCB Compiler Optimizations in the Berkeley UPC Translator Wei Chen the Berkeley UPC Group.

A Multi-platform Co-array Fortran Compiler for High-Performance Computing John Mellor-Crummey, Yuri Dotsenko, Cristian Coarfa {johnmc, dotsenko,

1 Qualifying ExamWei Chen Unified Parallel C (UPC) and the Berkeley UPC Compiler Wei Chen the Berkeley UPC Group 3/11/07.

Introduction to OpenMP Eric Aubanel Advanced Computational Research Laboratory Faculty of Computer Science, UNB Fredericton, New Brunswick.

Message-Passing Computing Chapter 2. Programming Multicomputer Design special parallel programming language –Occam Extend existing language to handle.

Unified Parallel C Kathy Yelick EECS, U.C. Berkeley and NERSC/LBNL NERSC Team: Dan Bonachea, Jason Duell, Paul Hargrove, Parry Husbands, Costin Iancu,

FORTRAN History. FORTRAN - Interesting Facts n FORTRAN is the oldest Language actively in use today. n FORTRAN is still used for new software development.

Experiences with Co-array Fortran on Hardware Shared Memory Platforms Yuri DotsenkoCristian Coarfa John Mellor-CrummeyDaniel Chavarria-Miranda Rice University,

A Multi-platform Co-Array Fortran Compiler for High-Performance Computing Cristian Coarfa, Yuri Dotsenko, John Mellor-Crummey {dotsenko, ccristi,

An Evaluation of Data-Parallel Compiler Support for Line-Sweep Applications Daniel Chavarría-Miranda John Mellor-Crummey Dept. of Computer Science Rice.

3/12/2013Computer Engg, IIT(BHU)1 PARALLEL COMPUTERS- 2.

1 HPJAVA I.K.UJJWAL 07M11A1217 Dept. of Information Technology B.S.I.T.

Benchmarking and Applications. Purpose of Our Benchmarking Effort Reveal compiler (and run-time systems) weak points and lack of adequate automatic optimizations.

Unified Parallel C at LBNL/UCB Berkeley UPC Runtime Report Jason Duell LBNL September 9, 2004.

OpenMP Runtime Extensions Many core Massively parallel environment Intel® Xeon Phi co-processor Blue Gene/Q MPI Internal Parallelism Optimizing MPI Implementation.

LECTURE 19 Subroutines and Parameter Passing. ABSTRACTION Recall: Abstraction is the process by which we can hide larger or more complex code fragments.

Parallel Performance Wizard: A Generalized Performance Analysis Tool Hung-Hsun Su, Max Billingsley III, Seth Koehler, John Curreri, Alan D. George PPW.

Unified Parallel C at LBNL/UCB UPC at LBNL/U.C. Berkeley Overview Kathy Yelick LBNL and U.C. Berkeley.

Global Trees: A Framework for Linked Data Structures on Distributed Memory Parallel Systems D. Brian Larkins, James Dinan, Sriram Krishnamoorthy, Srinivasan.

Co-array Fortran: Compilation, Performance, Languages Issues Cristian Coarfa Yuri Dotsenko John Mellor-Crummey Department of Computer Science Rice University.

UPC at NERSC/LBNL Kathy Yelick, Christian Bell, Dan Bonachea,

Support for Program Analysis as a First-Class Design Constraint in Legion Michael Bauer 02/22/17.

An Emerging, Portable Co-Array Fortran Compiler for High-Performance Computing Daniel Chavarría-Miranda, Cristian Coarfa, Yuri.

Experiences with Sweep3D Implementations in Co-array Fortran

Presentation transcript:

1 John Mellor-Crummey Cristian Coarfa, Yuri Dotsenko Department of Computer Science Rice University Experiences Building a Multi-platform Compiler for Co-array Fortran AHPCRC PGAS Workshop September, 2005

2 Goals for HPC Languages Expressiveness Ease of programming Portable performance Ubiquitous availability

3 PGAS Languages Global address space programming model –one-sided communication (GET/PUT) Programmer has control over performance-critical factors –data distribution and locality control –computation partitioning –communication placement Data movement and synchronization as language primitives –amenable to compiler-based communication optimization HPF & OpenMP compilers must get this right simpler than msg passing lacking in OpenMP

4 Co-array Fortran Programming Model SPMD process images –fixed number of images during execution –images operate asynchronously Both private and shared data –real x(20, 20) a private 20x20 array in each image –real y(20, 20)[*] a shared 20x20 array in each image Simple one-sided shared-memory communication –x(:,j:j+2) = y(:,p:p+2)[r] copy columns from image r into local columns Synchronization intrinsic functions –sync_all – a barrier and a memory fence –sync_mem – a memory fence –sync_team([team members to notify], [team members to wait for]) Pointers and (perhaps asymmetric) dynamic allocation Parallel I/O

5 integer a(10,20)[*] if (this_image() > 1) a(1:10,1:2) = a(1:10,19:20)[this_image()-1] a(10,20) image 1image 2image N image 1image 2image N One-sided Communication with Co-Arrays

6 CAF Compilers Cray compilers for X1 & T3E architectures Rice Co-Array Fortran Compiler ( cafc )

7 Performance comparable to that of hand-tuned MPI codes Rice cafc Compiler Source-to-source compiler –source-to-source yields multi-platform portability Implements core language features –core sufficient for non-trivial codes –preliminary support for derived types soon support for allocatable components Open source

8 Implementation Strategy Goals –portability –high performance on a wide range of platforms Approach –source-to-source compilation of CAF codes use Open64/SL Fortran 90 infrastructure CAF  Fortran 90 + communication operations –communication ARMCI and GASNet one-sided comm libraries for portability load/store communication on shared-memory platforms

9 Key Implementation Concerns Fast access to local co-array data Fast communication Overlap of communication and computation

10 Accessing Co-Array Data Two Representations SAVE and COMMON co-arrays as Fortran 90 pointers –F90 pointers to memory allocated outside Fortran run-time system –original references accessing local co-array data rhs(1,i,j,k,c) = … + u(1,i-1,j,k,c) - … –transformed references rhs%ptr(1,i,j,k,c) = … + u%ptr(1,i-1,j,k,c) - … Procedure co-array arguments as F90 explicit-shape arrays –CAF language requires explicit shape for co-array arguments real :: a(10,10,10)[*] type CAFDesc_real_3 real, pointer:: ptr(:,:,:) ! F90 pointer to local co-array data end Type CAFDesc_real_3 type(CAFDesc_real_3):: a

11 Performance Challenges Problem –Fortran 90 pointer-based representation does not convey the lack of co-array aliasing contiguity of co-array data co-array bounds information –lack of knowledge inhibits important code optimizations Approach: procedure splitting

12 Procedure Splitting subroutine f(…) real, save :: c(100)[*] interface subroutine f_inner(…, c_arg) real :: c_arg[*] end subroutine f_inner end interface call f_inner(…,c(1)) end subroutine f subroutine f_inner(…, c_arg) real :: c_arg(100)[*]... = c_arg(50)... end subroutine f_inner subroutine f(…) real, save :: c(100)[*]... = c(50)... end subroutine f CAF to CAF optimization Benefits better alias analysis contiguity of co-array data co-array bounds information better dependence analysis result: back-end compiler can generate better code

13 Implementing Communication x(1:n) = a(1:n)[p] + … General approach: use buffer to hold off processor data –allocate buffer –perform GET to fill buffer –perform computation: x(1:n) = buffer(1:n) + … –deallocate buffer Optimizations –no buffer for co-array to co-array copies –unbuffered load/store on shared-memory systems

14 Strided vs. Contiguous Transfers Problem –CAF remote reference might induce many small data transfers a(i,1:n)[p] = b(j,1:n) Solution –pack strided data on source and unpack it on destination Constraints –can’t express both source-level packing and unpacking for a one-sided transfer –two-sided packing/unpacking is awkward for users Preferred approach –have communication layer perform packing/unpacking

15 Pragmatics of Packing Who should implement packing? CAF programmer –difficult to program CAF compiler –must convert PUTs into two-sided communication to unpack difficult whole-program transformation Communication library –most natural place –ARMCI currently performs packing on Myrinet (at least)

16 Synchronization Original CAF specification: team synchronization only –sync_all, sync_team Limits performance on loosely-coupled architectures Point-to-point extensions –sync_notify(q) –sync_wait(p) Point to point synchronization semantics Delivery of a notify to q from p  all communication from p to q issued before the notify has been delivered to q

17 Hiding Communication Latency Goal: enable communication/computation overlap Impediments to generating non-blocking communication –use of indexed subscripts in co-dimensions –lack of whole program analysis Approach: support hints for non-blocking communication –overcome conservative compiler analysis –enable sophisticated programmers to achieve good performance today

18 Questions about PGAS Languages Performance –can performance match hand-tuned msg passing programs? –what are the obstacles to top performance? –what should be done to overcome them? language modifications or extensions? program implementation strategies? compiler technology? run-time system enhancements? Programmability –how easy is it to develop high performance programs?

19 Investigating these Issues Evaluate CAF, UPC, and MPI versions of NAS benchmarks Performance –compare CAF and UPC performance to that of MPI versions use hardware performance counters to pinpoint differences –determine optimization techniques common for both languages as well as language specific optimizations language features program implementation strategies compiler optimizations runtime optimizations Programmability –assess programmability of the CAF and UPC variants

20 Platforms and Benchmarks Platforms –Itanium2+Myrinet 2000 (900 MHz Itanium2) –Alpha+Quadrics QSNetI (1 GHz Alpha EV6.8CB) –SGI Altix 3000 (1.5 GHz Itanium2) –SGI Origin 2000 (R10000) Codes –NAS Parallel Benchmarks (NPB 2.3) from NASA Ames –MG, CG, SP, BT –CAF and UPC versions were derived from Fortran77+MPI versions

21 MG class A (256 3 ) on Itanium2+Myrinet2000 Intel compiler: restrict yields factor of 2.3 performance improvement UPC strided comm 28% faster than multiple transfers UPC point to point 49% faster than barriers CAF point to point 35% faster than barriers Higher is better

22 MG class C (512 3 ) on SGI Altix 3000 Intel C compiler: scalar performance Fortran compiler: linearized array subscripts 30% slowdown compared to multidimensional subscripts Higher is better 64

23 MG class B (256 3 ) on SGI Origin 2000 Higher is better

24 CG class C (150000) on SGI Altix 3000 Intel compiler: sum reductions in C 2.6 times slower than Fortran! point to point 19% faster than barriers Higher is better

25 CG class B (75000) on SGI Origin 2000 Intrepid compiler (gcc): sum reductions in C is up to 54% slower than SGI C/Fortran! Higher is better

26 SP class C (162 3 ) on Itanium2+Myrinet2000 restrict yields 18% performance improvement Higher is better

27 SP class C (162 3 ) on Alpha+Quadrics Higher is better

28 BT class C (162 3 ) on Itanium2+Myrinet2000 UPC: use of restrict boosts the performance 43% CAF: procedure splitting improves performance 42-60% UPC: comm. packing 32% faster CAF: comm. packing 7% faster Higher is better

29 BT class B (102 3 ) on SGI Altix 3000 use of restrict improves performance 30% Higher is better

30 Performance Observations Achieving highest performance can be difficult –need effective optimizing compilers for PGAS languages Communication layer is not the problem –CAF with ARMCI or GASNet yields equivalent performance Scalar code optimization of scientific code is the key! –SP+BT: SGI Fortran: unroll+jam, SWP –MG: SGI Fortran: loop alignment, fusion –CG: Intel Fortran: optimized sum reduction Linearized subscripts for multidimensional arrays hurt! –measured 30% performance gap with Intel Fortran

31 Performance Prescriptions For portable high performance, we need … Better language support for CAF synchronization –point-to-point synchronization is an important common case! –currently only a Rice extension outside the CAF standard Better CAF & UPC compiler support –communication vectorization –synchronization strength reduction: important for programmability Compiler optimization of loops with complex dependences Better run-time library support –efficient communication support for strided array sections

32 Programmability Observations Matching MPI performance required using bulk communication –communicating multi-dimensional array sections is natural in CAF –library-based primitives are cumbersome in UPC Strided communication is problematic for performance –tedious programming of packing/unpacking at src level Wavefront computations –MPI buffered communication easily decouples sender/receiver –PGAS models: buffering explicitly managed by programmer