1 An Evaluation of Global Address Space Languages: Co-Array Fortran and Unified Parallel C Cristian Coarfa, Yuri Dotsenko, John Mellor-Crummey Rice University Francois Cantonnet, Tarek El-Ghazawi, Ashrujit Mohanti, Yiyi Yao George Washington University Daniel Chavarria-Miranda Pacific Northwest National Laboratory
2 GAS Languages Global address space programming model –one-sided communication (GET/PUT) Programmer has control over performance-critical factors –data distribution and locality control –computation partitioning –communication placement Data movement and synchronization as language primitives –amenable to compiler-based communication optimization HPF & OpenMP compilers must get this right simpler than msg passing lacking in OpenMP
3 Questions Can GAS languages match the performance of hand-tuned message passing programs? What are the obstacles to obtaining performance with GAS languages? What should be done to ameliorate them? –by language modifications or extensions –by compilers –by run-time systems How easy is it to develop high performance programs in GAS languages?
4 Approach Evaluate CAF and UPC using NAS Parallel Benchmarks Compare performance to that of MPI versions –use hardware performance counters to pinpoint differences Determine optimization techniques common for both languages as well as language specific optimizations –language features –program implementation strategies –compiler optimizations –runtime optimizations Assess programmability of the CAF and UPC variants
5 Outline Questions and approach CAF & UPC –Features –Compilers –Performance considerations Experimental evaluation Conclusions
6 CAF & UPC Common Features SPMD programming model Both private and shared data Language-level one-sided shared-memory communication Synchronization intrinsic functions (barrier, fence) Pointers and dynamic allocation
7 CAF & UPC Differences I Multidimensional arrays –CAF: multidimensional arrays, procedure argument reshaping –UPC: linearization, typically using macros Local accesses to shared data –CAF: Fortran 90 array syntax without brackets, e.g. a(1:M,N) –UPC: shared array reference using MYTHREAD or a C pointer
8 CAF and UPC Differences II Scalar/element-wise remote accesses –CAF: multidimensional subscripts + bracket syntax a(1,1) = a(1,M)[this_image()-1] –UPC: shared (“flat”) array access with linearized subscripts a[N*M*MYTHREAD] = a[N*M*MYTHREAD-N] Bulk and strided remote accesses –CAF: use natural syntax of Fortran 90 array sections and operations on remote co-array sections (less temporaries on SMPs) –UPC: use library functions (and temporary storage to hold a copy)
9 CAF: integer a(N,M)[*] a(1:N,1:2) = a(1:N,M-1:M)[this_image()-1] UPC: shared int *a; upc_memget(&a[N*M*MYTHREAD], &a[N*M*MYTHREAD-2*N], 2*N*sizeof(int)); P1P1 P2P2 PNPN N M Bulk Communication
10 CAF & UPC Differences III Synchronization –CAF: team synchronization –UPC: split-phase barrier, locks UPC: worksharing construct upc_forall UPC: richer set of pointer types
11 Outline Questions and approach CAF & UPC –Features –Compilers –Performance considerations Experimental evaluation Conclusions
12 CAF Compilers Rice Co-Array Fortran Compiler ( cafc ) –Multi-platform compiler –Implements core of the language core sufficient for non-trivial codes currently lacks support for derived type and dynamic co-arrays –Source-to-source translator translates CAF into Fortran 90 and communication code uses ARMCI or GASNet as communication substrate can generate load/store for remote data accesses on SMPs –Performance comparable to that of hand-tuned MPI codes –Open source Vendor compilers: Cray
13 UPC Compilers Berkeley UPC Compiler –Multi-platform compiler –Implements full UPC 1.1 specification –Source-to-source translator converts UPC into ANSI C and calls to UPC runtime library & GASNet tailors code to a specific architecture: cluster or SMP –Open source Intrepid UPC compiler –Based on GCC compiler –Works on SGI Origin, Cray T3E and Linux SMP Other vendor compilers: Cray, HP
14 Outline Motivation and Goals CAF & UPC –Features –Compilers –Performance considerations Experimental evaluation Conclusions
15 Scalar Performance Generate code amenable to backend compiler optimizations –Quality of back end compilers poor reduction recognition in the Intel C compiler Local access to shared data –CAF: use F90 pointers and procedure arguments –UPC: use C pointers instead of UPC shared pointers Alias and dependence analysis –Fortran vs. C language semantics multidimensional arrays in Fortran procedure argument reshaping –Convey lack of aliasing for (non-aliased) shared variables CAF: use procedure splitting so co-arrays are referenced as arguments UPC: use restrict C99 keyword for C pointers used to access shared data
16 Communication Communication vectorization is essential for high performance on cluster architectures for both languages –CAF use F90 array sections (compiler translates to appropriate library calls) –UPC use library functions for contiguous transfers use UPC extensions for strided transfer in Berkeley UPC compiler Increase efficiency of strided transfers by packing/unpacking data at the language level
17 Synchronization Barrier-based synchronization –Can lead to over-synchronized code Use point-to-point synchronization –CAF: proposed language extension (sync_notify, sync_wait) –UPC: language-level implementation
18 Outline Questions and approach CAF & UPC Experimental evaluation Conclusions
19 Platforms and Benchmarks Platforms –Itanium2+Myrinet 2000 (900 MHz Itanium2) –Alpha+Quadrics QSNetI (1 GHz Alpha EV6.8CB) –SGI Altix 3000 (1.5 GHz Itanium2) –SGI Origin 2000 (R10000) Codes –NAS Parallel Benchmarks (NPB 2.3) from NASA Ames –MG, CG, SP, BT –CAF and UPC versions were derived from Fortran77+MPI versions
20 MG class A (256 3 ) on Itanium2+Myrinet2000 Intel compiler: restrict yields 2.3 time performance improvement UPC strided comm 28% faster than multiple transfers UPC point to point 49% faster than barriers CAF point to point 35% faster than barriers Higher is better
21 MG class C (512 3 ) on SGI Altix 3000 Intel C compiler: scalar performance Fortran compiler: linearized array subscripts 30% slowdown compared to multidimensional subscripts Higher is better 64
22 MG class B (256 3 ) on SGI Origin 2000 Higher is better
23 CG class C (150000) on SGI Altix 3000 Intel compiler: sum reductions in C 2.6 times slower than Fortran! point to point 19% faster than barriers Higher is better
24 CG class B (75000) on SGI Origin 2000 Intrepid compiler (gcc): sum reductions in C is up to 54% slower than SGI C/Fortran! Higher is better
25 SP class C (162 3 ) on Itanium2+Myrinet2000 restrict yields 18% performance improvement Higher is better
26 SP class C (162 3 ) on Alpha+Quadrics Higher is better
27 BT class C (162 3 ) on Itanium2+Myrinet2000 UPC: use of restrict boosts the performance 43% CAF: procedure splitting improves performance 42-60% UPC: comm. packing 32% faster CAF: comm. packing 7% faster Higher is better
28 BT class B (102 3 ) on SGI Altix 3000 use of restrict improves performance 30% Higher is better
29 Conclusions Matching MPI performance required using bulk communication –library-based primitives are cumbersome in UPC –communicating multi-dimensional array sections is natural in CAF –lack of efficient run-time support for strided communication is a problem With CAF, can achieve performance comparable to MPI With UPC, matching MPI performance can be difficult –CG: able to match MPI on all platforms –SP, BT, MG: substantial gap remains
30 Why the Gap? Communication layer is not the problem –CAF with ARMCI or GASNet yields equivalent performance Scalar code optimization of scientific code is the key! –SP+BT: SGI Fortran: unroll+jam, SWP –MG: SGI Fortran: loop alignment, fusion –CG: Intel Fortran: optimized sum reduction Linearized subscripts for multidimensional arrays hurt! –measured 30% performance gap with Intel Fortran
31 Programming for Performance In the absence of effective optimizing compilers for CAF and UPC, achieving high performance is difficult To make codes efficient across the full range of architectures, we need –better language support for synchronization point-to-point synchronization is an important common case! –better CAF & UPC compiler support communication vectorization synchronization strength reduction –better compiler optimization of loops with complex dependence patterns –better run-time library support efficient communication of strided array sections