1 An Evaluation of Global Address Space Languages: Co-Array Fortran and Unified Parallel C Cristian Coarfa, Yuri Dotsenko, John Mellor-Crummey Rice University.

Slides:



Advertisements
Similar presentations
1 ISCM-10 Taub Computing Center High Performance Computing for Computational Mechanics Moshe Goldberg March 29, 2001.
Advertisements

Unified Parallel C at LBNL/UCB Implementing a Global Address Space Language on the Cray X1 Christian Bell and Wei Chen.
Software & Services Group, Developer Products Division Copyright© 2010, Intel Corporation. All rights reserved. *Other brands and names are the property.
Introduction to the Partitioned Global Address Space (PGAS) Programming Model David E. Hudak, Ph.D. Program Director for HPC Engineering
Konstantin Berlin 1, Jun Huan 2, Mary Jacob 3, Garima Kochhar 3, Jan Prins 2, Bill Pugh 1, P. Sadayappan 3, Jaime Spacco 1, Chau-Wen Tseng 1 1 University.
PGAS Language Update Kathy Yelick. PGAS Languages: Why use 2 Programming Models when 1 will do? Global address space: thread may directly read/write remote.
Unified Parallel C at LBNL/UCB The Berkeley UPC Compiler: Implementation and Performance Wei Chen, Dan Bonachea, Jason Duell, Parry Husbands, Costin Iancu,
1 Presentation at the 4 th PMEO-PDS Workshop Benchmark Measurements of Current UPC Platforms Zhang Zhang and Steve Seidel Michigan Technological University.
Co-array Fortran Performance and Potential: an NPB Experimental Study Cristian Coarfa Yuri Dotsenko Jason Lee EckhardtJohn Mellor-Crummey Department of.
DISTRIBUTED AND HIGH-PERFORMANCE COMPUTING CHAPTER 7: SHARED MEMORY PARALLEL PROGRAMMING.
Overview *Unified Parallel C is an extension to ANSI C. *UPC is a global address space language for parallel programming. * UPC extends C by providing.
Unified Parallel C at LBNL/UCB Implementing a Global Address Space Language on the Cray X1: the Berkeley UPC Experience Christian Bell and Wei Chen CS252.
Parallel Computing Overview CS 524 – High-Performance Computing.
CS 240A: Models of parallel programming: Machines, languages, and complexity measures.
Java for High Performance Computing Jordi Garcia Almiñana 14 de Octubre de 1998 de la era post-internet.
Performance Implications of Communication Mechanisms in All-Software Global Address Space Systems Chi-Chao Chang Dept. of Computer Science Cornell University.
Tile Reduction: the first step towards tile aware parallelization in OpenMP Ge Gan Department of Electrical and Computer Engineering Univ. of Delaware.
 2006 Michigan Technological UniversityIPDPS200616/2/6 1 Zhang Zhang, Steve Seidel Department of Computer Science Michigan Technological University
UPC at CRD/LBNL Kathy Yelick Dan Bonachea, Jason Duell, Paul Hargrove, Parry Husbands, Costin Iancu, Mike Welcome, Christian Bell.
Semi-Automatic Composition of Data Layout Transformations for Loop Vectorization Shixiong Xu, David Gregg University of Dublin, Trinity College
Unified Parallel C at LBNL/UCB FT Benchmark in UPC Christian Bell and Rajesh Nishtala.
GASP: A Performance Tool Interface for Global Address Space Languages & Libraries Adam Leko 1, Dan Bonachea 2, Hung-Hsun Su 1, Bryan Golden 1, Hans Sherburne.
Programming for High Performance Computers John M. Levesque Director Cray’s Supercomputing Center Of Excellence.
Lecture 29 Fall 2006 Lecture 29: Parallel Programming Overview.
Lecture 4: Parallel Programming Models. Parallel Programming Models Parallel Programming Models: Data parallelism / Task parallelism Explicit parallelism.
ICOM 5995: Performance Instrumentation and Visualization for High Performance Computer Systems Lecture 7 October 16, 2002 Nayda G. Santiago.
Adaptive MPI Milind A. Bhandarkar
1 A Multi-platform Co-Array Fortran Compiler Yuri Dotsenko Cristian Coarfa John Mellor-Crummey Department of Computer Science Rice University Houston,
Center for Programming Models for Scalable Parallel Computing: Project Meeting Report Libraries, Languages, and Execution Models for Terascale Applications.
Compiler BE Panel IDC HPC User Forum April 2009 Don Kretsch Director, Sun Developer Tools Sun Microsystems.
Unified Parallel C at LBNL/UCB The Berkeley UPC Compiler: Implementation and Performance Wei Chen the LBNL/Berkeley UPC Group.
Programming in UPC Burt Gordon HPN Group, HCS lab Taken in part from a Presentation by Tarek El-Ghazawi at SC2003.
CSE 260 – Parallel Processing UCSD Fall 2006 A Performance Characterization of UPC Presented by – Anup Tapadia Fallon Chen.
1 John Mellor-Crummey Cristian Coarfa, Yuri Dotsenko Department of Computer Science Rice University Experiences Building a Multi-platform Compiler for.
Towards Optimized UPC Implementations Tarek A. El-Ghazawi The George Washington University Tarek A. El-Ghazawi The George Washington University.
Compilation Technology SCINET compiler workshop | February 17-18, 2009 © 2009 IBM Corporation Software Group Coarray: a parallel extension to Fortran Jim.
AN EXTENDED OPENMP TARGETING ON THE HYBRID ARCHITECTURE OF SMP-CLUSTER Author : Y. Zhao 、 C. Hu 、 S. Wang 、 S. Zhang Source : Proceedings of the 2nd IASTED.
UPC Applications Parry Husbands. Roadmap Benchmark small applications and kernels —SPMV (for iterative linear/eigen solvers) —Multigrid Develop sense.
Co-Array Fortran Open-source compilers and tools for scalable global address space computing John Mellor-Crummey Rice University.
Case Study in Computational Science & Engineering - Lecture 2 1 Parallel Architecture Models Shared Memory –Dual/Quad Pentium, Cray T90, IBM Power3 Node.
Supercomputing ‘99 Parallelization of a Dynamic Unstructured Application using Three Leading Paradigms Leonid Oliker NERSC Lawrence Berkeley National Laboratory.
High-Performance Parallel Scientific Computing 2008 Purdue University OpenMP Tutorial Seung-Jai Min School of Electrical and Computer.
HPC User Forum Back End Compiler Panel SiCortex Perspective Kevin Harris Compiler Manager April 2009.
High-Level, One-Sided Models on MPI: A Case Study with Global Arrays and NWChem James Dinan, Pavan Balaji, Jeff R. Hammond (ANL); Sriram Krishnamoorthy.
Spring 2003CSE P5481 Issues in Multiprocessors Which programming model for interprocessor communication shared memory regular loads & stores message passing.
Copyright © 2009 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Principles of Parallel Programming First Edition by Calvin Lin Lawrence Snyder.
University of Minnesota Comments on Co-Array Fortran Robert W. Numrich Minnesota Supercomputing Institute University of Minnesota, Minneapolis.
Software Caching for UPC Wei Chen Jason Duell Jimmy Su Spring 2003.
October 11, 2007 © 2007 IBM Corporation Multidimensional Blocking in UPC Christopher Barton, Călin Caşcaval, George Almási, Rahul Garg, José Nelson Amaral,
Unified Parallel C at LBNL/UCB Compiler Optimizations in the Berkeley UPC Translator Wei Chen the Berkeley UPC Group.
A Multi-platform Co-array Fortran Compiler for High-Performance Computing John Mellor-Crummey, Yuri Dotsenko, Cristian Coarfa {johnmc, dotsenko,
1 Qualifying ExamWei Chen Unified Parallel C (UPC) and the Berkeley UPC Compiler Wei Chen the Berkeley UPC Group 3/11/07.
Introduction to OpenMP Eric Aubanel Advanced Computational Research Laboratory Faculty of Computer Science, UNB Fredericton, New Brunswick.
Unified Parallel C Kathy Yelick EECS, U.C. Berkeley and NERSC/LBNL NERSC Team: Dan Bonachea, Jason Duell, Paul Hargrove, Parry Husbands, Costin Iancu,
FORTRAN History. FORTRAN - Interesting Facts n FORTRAN is the oldest Language actively in use today. n FORTRAN is still used for new software development.
Experiences with Co-array Fortran on Hardware Shared Memory Platforms Yuri DotsenkoCristian Coarfa John Mellor-CrummeyDaniel Chavarria-Miranda Rice University,
OpenMP for Networks of SMPs Y. Charlie Hu, Honghui Lu, Alan L. Cox, Willy Zwaenepoel ECE1747 – Parallel Programming Vicky Tsang.
A Multi-platform Co-Array Fortran Compiler for High-Performance Computing Cristian Coarfa, Yuri Dotsenko, John Mellor-Crummey {dotsenko, ccristi,
An Evaluation of Data-Parallel Compiler Support for Line-Sweep Applications Daniel Chavarría-Miranda John Mellor-Crummey Dept. of Computer Science Rice.
3/12/2013Computer Engg, IIT(BHU)1 OpenMP-1. OpenMP is a portable, multiprocessing API for shared memory computers OpenMP is not a “language” Instead,
1 HPJAVA I.K.UJJWAL 07M11A1217 Dept. of Information Technology B.S.I.T.
Benchmarking and Applications. Purpose of Our Benchmarking Effort Reveal compiler (and run-time systems) weak points and lack of adequate automatic optimizations.
Unified Parallel C at LBNL/UCB Berkeley UPC Runtime Report Jason Duell LBNL September 9, 2004.
Parallel Performance Wizard: A Generalized Performance Analysis Tool Hung-Hsun Su, Max Billingsley III, Seth Koehler, John Curreri, Alan D. George PPW.
Unified Parallel C at LBNL/UCB UPC at LBNL/U.C. Berkeley Overview Kathy Yelick LBNL and U.C. Berkeley.
Co-array Fortran: Compilation, Performance, Languages Issues Cristian Coarfa Yuri Dotsenko John Mellor-Crummey Department of Computer Science Rice University.
UPC at NERSC/LBNL Kathy Yelick, Christian Bell, Dan Bonachea,
An Emerging, Portable Co-Array Fortran Compiler for High-Performance Computing Daniel Chavarría-Miranda, Cristian Coarfa, Yuri.
Computer Engg, IIT(BHU)
Experiences with Sweep3D Implementations in Co-array Fortran
Presentation transcript:

1 An Evaluation of Global Address Space Languages: Co-Array Fortran and Unified Parallel C Cristian Coarfa, Yuri Dotsenko, John Mellor-Crummey Rice University Francois Cantonnet, Tarek El-Ghazawi, Ashrujit Mohanti, Yiyi Yao George Washington University Daniel Chavarria-Miranda Pacific Northwest National Laboratory

2 GAS Languages Global address space programming model –one-sided communication (GET/PUT) Programmer has control over performance-critical factors –data distribution and locality control –computation partitioning –communication placement Data movement and synchronization as language primitives –amenable to compiler-based communication optimization HPF & OpenMP compilers must get this right simpler than msg passing lacking in OpenMP

3 Questions Can GAS languages match the performance of hand-tuned message passing programs? What are the obstacles to obtaining performance with GAS languages? What should be done to ameliorate them? –by language modifications or extensions –by compilers –by run-time systems How easy is it to develop high performance programs in GAS languages?

4 Approach Evaluate CAF and UPC using NAS Parallel Benchmarks Compare performance to that of MPI versions –use hardware performance counters to pinpoint differences Determine optimization techniques common for both languages as well as language specific optimizations –language features –program implementation strategies –compiler optimizations –runtime optimizations Assess programmability of the CAF and UPC variants

5 Outline Questions and approach CAF & UPC –Features –Compilers –Performance considerations Experimental evaluation Conclusions

6 CAF & UPC Common Features SPMD programming model Both private and shared data Language-level one-sided shared-memory communication Synchronization intrinsic functions (barrier, fence) Pointers and dynamic allocation

7 CAF & UPC Differences I Multidimensional arrays –CAF: multidimensional arrays, procedure argument reshaping –UPC: linearization, typically using macros Local accesses to shared data –CAF: Fortran 90 array syntax without brackets, e.g. a(1:M,N) –UPC: shared array reference using MYTHREAD or a C pointer

8 CAF and UPC Differences II Scalar/element-wise remote accesses –CAF: multidimensional subscripts + bracket syntax a(1,1) = a(1,M)[this_image()-1] –UPC: shared (“flat”) array access with linearized subscripts a[N*M*MYTHREAD] = a[N*M*MYTHREAD-N] Bulk and strided remote accesses –CAF: use natural syntax of Fortran 90 array sections and operations on remote co-array sections (less temporaries on SMPs) –UPC: use library functions (and temporary storage to hold a copy)

9 CAF: integer a(N,M)[*] a(1:N,1:2) = a(1:N,M-1:M)[this_image()-1] UPC: shared int *a; upc_memget(&a[N*M*MYTHREAD], &a[N*M*MYTHREAD-2*N], 2*N*sizeof(int)); P1P1 P2P2 PNPN N M Bulk Communication

10 CAF & UPC Differences III Synchronization –CAF: team synchronization –UPC: split-phase barrier, locks UPC: worksharing construct upc_forall UPC: richer set of pointer types

11 Outline Questions and approach CAF & UPC –Features –Compilers –Performance considerations Experimental evaluation Conclusions

12 CAF Compilers Rice Co-Array Fortran Compiler ( cafc ) –Multi-platform compiler –Implements core of the language core sufficient for non-trivial codes currently lacks support for derived type and dynamic co-arrays –Source-to-source translator translates CAF into Fortran 90 and communication code uses ARMCI or GASNet as communication substrate can generate load/store for remote data accesses on SMPs –Performance comparable to that of hand-tuned MPI codes –Open source Vendor compilers: Cray

13 UPC Compilers Berkeley UPC Compiler –Multi-platform compiler –Implements full UPC 1.1 specification –Source-to-source translator converts UPC into ANSI C and calls to UPC runtime library & GASNet tailors code to a specific architecture: cluster or SMP –Open source Intrepid UPC compiler –Based on GCC compiler –Works on SGI Origin, Cray T3E and Linux SMP Other vendor compilers: Cray, HP

14 Outline Motivation and Goals CAF & UPC –Features –Compilers –Performance considerations Experimental evaluation Conclusions

15 Scalar Performance Generate code amenable to backend compiler optimizations –Quality of back end compilers poor reduction recognition in the Intel C compiler Local access to shared data –CAF: use F90 pointers and procedure arguments –UPC: use C pointers instead of UPC shared pointers Alias and dependence analysis –Fortran vs. C language semantics multidimensional arrays in Fortran procedure argument reshaping –Convey lack of aliasing for (non-aliased) shared variables CAF: use procedure splitting so co-arrays are referenced as arguments UPC: use restrict C99 keyword for C pointers used to access shared data

16 Communication Communication vectorization is essential for high performance on cluster architectures for both languages –CAF use F90 array sections (compiler translates to appropriate library calls) –UPC use library functions for contiguous transfers use UPC extensions for strided transfer in Berkeley UPC compiler Increase efficiency of strided transfers by packing/unpacking data at the language level

17 Synchronization Barrier-based synchronization –Can lead to over-synchronized code Use point-to-point synchronization –CAF: proposed language extension (sync_notify, sync_wait) –UPC: language-level implementation

18 Outline Questions and approach CAF & UPC Experimental evaluation Conclusions

19 Platforms and Benchmarks Platforms –Itanium2+Myrinet 2000 (900 MHz Itanium2) –Alpha+Quadrics QSNetI (1 GHz Alpha EV6.8CB) –SGI Altix 3000 (1.5 GHz Itanium2) –SGI Origin 2000 (R10000) Codes –NAS Parallel Benchmarks (NPB 2.3) from NASA Ames –MG, CG, SP, BT –CAF and UPC versions were derived from Fortran77+MPI versions

20 MG class A (256 3 ) on Itanium2+Myrinet2000 Intel compiler: restrict yields 2.3 time performance improvement UPC strided comm 28% faster than multiple transfers UPC point to point 49% faster than barriers CAF point to point 35% faster than barriers Higher is better

21 MG class C (512 3 ) on SGI Altix 3000 Intel C compiler: scalar performance Fortran compiler: linearized array subscripts 30% slowdown compared to multidimensional subscripts Higher is better 64

22 MG class B (256 3 ) on SGI Origin 2000 Higher is better

23 CG class C (150000) on SGI Altix 3000 Intel compiler: sum reductions in C 2.6 times slower than Fortran! point to point 19% faster than barriers Higher is better

24 CG class B (75000) on SGI Origin 2000 Intrepid compiler (gcc): sum reductions in C is up to 54% slower than SGI C/Fortran! Higher is better

25 SP class C (162 3 ) on Itanium2+Myrinet2000 restrict yields 18% performance improvement Higher is better

26 SP class C (162 3 ) on Alpha+Quadrics Higher is better

27 BT class C (162 3 ) on Itanium2+Myrinet2000 UPC: use of restrict boosts the performance 43% CAF: procedure splitting improves performance 42-60% UPC: comm. packing 32% faster CAF: comm. packing 7% faster Higher is better

28 BT class B (102 3 ) on SGI Altix 3000 use of restrict improves performance 30% Higher is better

29 Conclusions Matching MPI performance required using bulk communication –library-based primitives are cumbersome in UPC –communicating multi-dimensional array sections is natural in CAF –lack of efficient run-time support for strided communication is a problem With CAF, can achieve performance comparable to MPI With UPC, matching MPI performance can be difficult –CG: able to match MPI on all platforms –SP, BT, MG: substantial gap remains

30 Why the Gap? Communication layer is not the problem –CAF with ARMCI or GASNet yields equivalent performance Scalar code optimization of scientific code is the key! –SP+BT: SGI Fortran: unroll+jam, SWP –MG: SGI Fortran: loop alignment, fusion –CG: Intel Fortran: optimized sum reduction Linearized subscripts for multidimensional arrays hurt! –measured 30% performance gap with Intel Fortran

31 Programming for Performance In the absence of effective optimizing compilers for CAF and UPC, achieving high performance is difficult To make codes efficient across the full range of architectures, we need –better language support for synchronization point-to-point synchronization is an important common case! –better CAF & UPC compiler support communication vectorization synchronization strength reduction –better compiler optimization of loops with complex dependence patterns –better run-time library support efficient communication of strided array sections