Experiences with Co-array Fortran on Hardware Shared Memory Platforms Yuri DotsenkoCristian Coarfa John Mellor-CrummeyDaniel Chavarria-Miranda Rice University,

Slides:



Advertisements
Similar presentations
OpenMP Optimization National Supercomputing Service Swiss National Supercomputing Center.
Advertisements

Unified Parallel C at LBNL/UCB Implementing a Global Address Space Language on the Cray X1 Christian Bell and Wei Chen.
Introductions to Parallel Programming Using OpenMP
Thoughts on Shared Caches Jeff Odom University of Maryland.
Konstantin Berlin 1, Jun Huan 2, Mary Jacob 3, Garima Kochhar 3, Jan Prins 2, Bill Pugh 1, P. Sadayappan 3, Jaime Spacco 1, Chau-Wen Tseng 1 1 University.
1 An Evaluation of Global Address Space Languages: Co-Array Fortran and Unified Parallel C Cristian Coarfa, Yuri Dotsenko, John Mellor-Crummey Rice University.
The Stanford Directory Architecture for Shared Memory (DASH)* Presented by: Michael Bauer ECE 259/CPS 221 Spring Semester 2008 Dr. Lebeck * Based on “The.
Understanding Application Scaling NAS Parallel Benchmarks 2.2 on NOW and SGI Origin 2000 Frederick Wong, Rich Martin, Remzi Arpaci-Dusseau, David Wu, and.
Unified Parallel C at LBNL/UCB The Berkeley UPC Compiler: Implementation and Performance Wei Chen, Dan Bonachea, Jason Duell, Parry Husbands, Costin Iancu,
Revisiting a slide from the syllabus: CS 525 will cover Parallel and distributed computing architectures – Shared memory processors – Distributed memory.
1 Presentation at the 4 th PMEO-PDS Workshop Benchmark Measurements of Current UPC Platforms Zhang Zhang and Steve Seidel Michigan Technological University.
MPI in uClinux on Microblaze Neelima Balakrishnan Khang Tran 05/01/2006.
Co-array Fortran Performance and Potential: an NPB Experimental Study Cristian Coarfa Yuri Dotsenko Jason Lee EckhardtJohn Mellor-Crummey Department of.
DISTRIBUTED AND HIGH-PERFORMANCE COMPUTING CHAPTER 7: SHARED MEMORY PARALLEL PROGRAMMING.
Tuesday, September 12, 2006 Nothing is impossible for people who don't have to do it themselves. - Weiler.
Unified Parallel C at LBNL/UCB Implementing a Global Address Space Language on the Cray X1: the Berkeley UPC Experience Christian Bell and Wei Chen CS252.
Parallel Computing Overview CS 524 – High-Performance Computing.
CS 240A: Models of parallel programming: Machines, languages, and complexity measures.
Java for High Performance Computing Jordi Garcia Almiñana 14 de Octubre de 1998 de la era post-internet.
Tile Reduction: the first step towards tile aware parallelization in OpenMP Ge Gan Department of Electrical and Computer Engineering Univ. of Delaware.
UPC at CRD/LBNL Kathy Yelick Dan Bonachea, Jason Duell, Paul Hargrove, Parry Husbands, Costin Iancu, Mike Welcome, Christian Bell.
1 MPI-2 and Threads. 2 What are Threads? l Executing program (process) is defined by »Address space »Program Counter l Threads are multiple program counters.
Unified Parallel C at LBNL/UCB FT Benchmark in UPC Christian Bell and Rajesh Nishtala.
Introduction to Symmetric Multiprocessors Süha TUNA Bilişim Enstitüsü UHeM Yaz Çalıştayı
Programming for High Performance Computers John M. Levesque Director Cray’s Supercomputing Center Of Excellence.
OMPi: A portable C compiler for OpenMP V2.0 Elias Leontiadis George Tzoumas Vassilios V. Dimakopoulos University of Ioannina.
OpenMP in a Heterogeneous World Ayodunni Aribuki Advisor: Dr. Barbara Chapman HPCTools Group University of Houston.
ICOM 5995: Performance Instrumentation and Visualization for High Performance Computer Systems Lecture 7 October 16, 2002 Nayda G. Santiago.
Parallel Programming Models Jihad El-Sana These slides are based on the book: Introduction to Parallel Computing, Blaise Barney, Lawrence Livermore National.
1 A Multi-platform Co-Array Fortran Compiler Yuri Dotsenko Cristian Coarfa John Mellor-Crummey Department of Computer Science Rice University Houston,
P A R A L L E L C O M P U T I N G L A B O R A T O R Y EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB Optimizing Collective Communication.
1 John Mellor-Crummey Cristian Coarfa, Yuri Dotsenko Department of Computer Science Rice University Experiences Building a Multi-platform Compiler for.
Compilation Technology SCINET compiler workshop | February 17-18, 2009 © 2009 IBM Corporation Software Group Coarray: a parallel extension to Fortran Jim.
AN EXTENDED OPENMP TARGETING ON THE HYBRID ARCHITECTURE OF SMP-CLUSTER Author : Y. Zhao 、 C. Hu 、 S. Wang 、 S. Zhang Source : Proceedings of the 2nd IASTED.
Parallel Computer Architecture and Interconnect 1b.1.
UPC Applications Parry Husbands. Roadmap Benchmark small applications and kernels —SPMV (for iterative linear/eigen solvers) —Multigrid Develop sense.
Introduction, background, jargon Jakub Yaghob. Literature T.G.Mattson, B.A.Sanders, B.L.Massingill: Patterns for Parallel Programming, Addison- Wesley,
Co-Array Fortran Open-source compilers and tools for scalable global address space computing John Mellor-Crummey Rice University.
Frank Casilio Computer Engineering May 15, 1997 Multithreaded Processors.
 Copyright, HiPERiSM Consulting, LLC, George Delic, Ph.D. HiPERiSM Consulting, LLC (919) P.O. Box 569, Chapel Hill, NC.
Case Study in Computational Science & Engineering - Lecture 2 1 Parallel Architecture Models Shared Memory –Dual/Quad Pentium, Cray T90, IBM Power3 Node.
HPC User Forum Back End Compiler Panel SiCortex Perspective Kevin Harris Compiler Manager April 2009.
CUDA - 2.
Spring 2003CSE P5481 Issues in Multiprocessors Which programming model for interprocessor communication shared memory regular loads & stores message passing.
Investigating Adaptive Compilation using the MIPSpro Compiler Keith D. Cooper Todd Waterman Department of Computer Science Rice University Houston, TX.
Software Caching for UPC Wei Chen Jason Duell Jimmy Su Spring 2003.
A Multi-platform Co-array Fortran Compiler for High-Performance Computing John Mellor-Crummey, Yuri Dotsenko, Cristian Coarfa {johnmc, dotsenko,
1 Qualifying ExamWei Chen Unified Parallel C (UPC) and the Berkeley UPC Compiler Wei Chen the Berkeley UPC Group 3/11/07.
Introduction to OpenMP Eric Aubanel Advanced Computational Research Laboratory Faculty of Computer Science, UNB Fredericton, New Brunswick.
Ronny Krashinsky Erik Machnicki Software Cache Coherent Shared Memory under Split-C.
Unified Parallel C Kathy Yelick EECS, U.C. Berkeley and NERSC/LBNL NERSC Team: Dan Bonachea, Jason Duell, Paul Hargrove, Parry Husbands, Costin Iancu,
08/10/ NRL Hybrid QR Factorization Algorithm for High Performance Computing Architectures Peter Vouras Naval Research Laboratory Radar Division Professor.
A Multi-platform Co-Array Fortran Compiler for High-Performance Computing Cristian Coarfa, Yuri Dotsenko, John Mellor-Crummey {dotsenko, ccristi,
Single Node Optimization Computational Astrophysics.
An Evaluation of Data-Parallel Compiler Support for Line-Sweep Applications Daniel Chavarría-Miranda John Mellor-Crummey Dept. of Computer Science Rice.
Disco: Running Commodity Operating Systems on Scalable Multiprocessors Presented by: Pierre LaBorde, Jordan Deveroux, Imran Ali, Yazen Ghannam, Tzu-Wei.
1 HPJAVA I.K.UJJWAL 07M11A1217 Dept. of Information Technology B.S.I.T.
Parallel Computing Presented by Justin Reschke
Benchmarking and Applications. Purpose of Our Benchmarking Effort Reveal compiler (and run-time systems) weak points and lack of adequate automatic optimizations.
Background Computer System Architectures Computer System Software.
LECTURE 19 Subroutines and Parameter Passing. ABSTRACTION Recall: Abstraction is the process by which we can hide larger or more complex code fragments.
Global Trees: A Framework for Linked Data Structures on Distributed Memory Parallel Systems D. Brian Larkins, James Dinan, Sriram Krishnamoorthy, Srinivasan.
Co-array Fortran: Compilation, Performance, Languages Issues Cristian Coarfa Yuri Dotsenko John Mellor-Crummey Department of Computer Science Rice University.
An Emerging, Portable Co-Array Fortran Compiler for High-Performance Computing Daniel Chavarría-Miranda, Cristian Coarfa, Yuri.
The HP OpenVMS Itanium® Calling Standard
Experiences with Sweep3D Implementations in Co-array Fortran
Hybrid Programming with OpenMP and MPI
Department of Computer Science University of California, Santa Barbara
Support for Adaptivity in ARMCI Using Migratable Objects
Presentation transcript:

Experiences with Co-array Fortran on Hardware Shared Memory Platforms Yuri DotsenkoCristian Coarfa John Mellor-CrummeyDaniel Chavarria-Miranda Rice University, Houston, TX

Co-array Fortran Global Address Space (GAS) language SPMD programming model Simple extension of Fortran 90 Explicit control over data placement and computation distribution Private data Shared data: both local and remote One-sided communication (PUT and GET) Team and point-to-point synchronization

Co-array Fortran: Example integer :: a(10,20)[*] if (this_image() > 1) a(1:10,1:2) = a(1:10,19:20)[this_image()-1] a(10,20) image 1image 2image N image 1image 2image N Copies from left neighbor

Compiling CAF Source-to-source translation Prototype Rice cafc Fortran 90 pointer-based co-array representation ARMCI-based data movement Goal: performance transparency Challenges: Retain CAF source-level information Array contiguity, array bounds, lack of aliasing Exploit efficient fine-grain communication on SMPs

Outline Co-array representation and data access Local data Remote data Experimental evaluation Conclusions

Representation and Access for Local Data Efficient local access to SAVE/COMMON co- arrays is crucial to achieving best performance on a target architecture Fortran 90 pointer Fortran 90 pointer to structure Cray pointer Subroutine argument COMMON block (need support for symmetric shared objects)

Fortran 90 Pointer Representation CAF declaration: real, save :: a(10,20)[*] After translation: type T1 integer(PtrSize) handle real, pointer :: local(:,:) end type T1 type (T1) ca Local access: ca%local(2,3) Portable representation Back-end compiler has no knowledge about: Potential aliasing (no-alias flags for some compilers) Contiguity Bounds Implemented in cafc

Fortran 90 Pointer to Structure Representation CAF declaration: real, save :: a(10,20)[*] After translation: type T1 real :: local(10,20) end type T1 type (T1), pointer :: ca Conveys constant bounds and contiguity Potential aliasing is still a problem

Cray Pointer Representation CAF declaration: real, save :: a(10,20)[*] After translation: real :: a_local(10,20) pointer (a_ptr, a_local) Conveys constant bounds and contiguity Potential aliasing is still a problem Cray pointer is not in Fortran 90 standard

Subroutine Argument Representation CAF source: subroutine foo(…) real, save :: a(10,20)[*] a(i,j) = … + a(i-1,j) * … end subroutine foo After translation: subroutine foo(…) ! F90 representation for co-array a call foo_body(ca%local(1,1), ca%handle, …) end subroutine foo subroutine foo_body(a_local, a_handle, …) real :: a_local(10,20) a_local(i,j) = … + a_local(i-1,j) * … end subroutine foo_body

Subroutine Argument Representation (cont.) Avoid conservative assumptions about co- array aliasing by the back-end compiler Performance is close to optimal Extra procedures and procedure calls Implemented in cafc

COMMON Block Representation CAF declaration: real :: a(10,20)[*] common /a_cb/ a After translation: real :: ca(10,20) common /ca_cb/ ca Yields best performance for local accesses OS must support symmetric data objects

Outline Co-array representation and data access Local data Remote data Experimental evaluation Conclusions

Generating CAF Communication Generic parallel architectures Library function calls to move data Shared memory architectures (load/store) Fortran 90 pointers Vector of Fortran 90 pointers Cray pointers

Communication Generation for Generic Parallel Architectures CAF code: a(:) = b(:)[p] + … Translated code: allocate b_temp(:) call GET( b, p, b_temp, … ) a(:) = b_temp(:) + … deallocate b_temp Portable: works on clusters and SMPs Function overhead per fine-grain access Uses temporary to hold off-processor data Implemented in cafc

Communication Generation Using Fortran 90 Pointers CAF code: do j = 1, N C(j) = A(j)[p] end do Translated code: do j = 1, N ptrA => A(j) call CafSetPtr(ptrA,p,A_handle) C(j) = ptrA end do Function call overhead for each reference Implemented in cafc

Pointer Initialization Hoisting Naïvely translated code: do j = 1, N ptrA => A(j) call CafSetPtr(ptrA,p,A_handle) C(j) = ptrA end do Code with hoisted pointer initialization: ptrA => A(1:N) call CafSetPtr(ptrA,p,A_handle) do j = 1, N C(j) = ptrA(j) end do Pointer initialization hoisting is not yet implemented in cafc

Communication Generation Using Vector of Fortran 90 Pointers CAF code: do j = 1, N C(j) = A(j)[p] end do Translated code: … initialization … do j = 1, N C(j) = ptrVectorA(p)%ptrA(j) end do Does not require pointer initialization hoisting and avoids function calls Worse performance than that of hoisted pointer initialization

Communication Generation Using Cray Pointers CAF code: do j = 1, N C(j) = A(j)[p] end do Translated code: integer(PtrSize) :: addrA(:) … addrA initialization … do j = 1, N ptrA = addrA(p) C(j) = A_rem(j) end do addrA(p) – address of co-array A on image p Cray pointer initialization hoisting yields only marginal improvement

Outline Co-array representation and data access Local data Remote data Experimental evaluation Conclusions

Experimental Platforms SGI Altix Itanium2 1.5 GHz, 6 MB L3 cache processors Linux ( kernel) Intel Fortran Compiler 8.0 SGI Origin MIPS R MHz, 8 MB L2 cache processors IRIX MIPSpro Compiler m

Benchmarks STREAM Random Access Spark98 NAS MG and SP

STREAM Copy kernelDO J = 1, N C(J) = A(J) C(J) = A(J)[p]END DO Triad kernelDO J = 1, N A(J)=B(J)+s*C(J) A(J)=B(J)[p]+s*C(J)[p]END DO Goal: investigate how well architecture bandwidth can be delivered up to the language level

STREAM: Local Accesses COMMON block is the best, if platform allows Subroutine parameter has similar performance to COMMON block representation Pointer-based representations have performance within 5% of the best on the Altix (with no-aliasing flag), and within 15% on the Origin Fortran 90 pointer representation yields 30% of performance on the Altix without using the flag to specify lack of pointer aliasing Array section statements with Fortran 90 pointer representation yield 40-50% performance on the Origin

STREAM: Remote Accesses COMMON block representation for local access + Cray pointer for remote accesses is the best Subroutine argument + Cray pointer for remote accesses has similar performance Remote accesses with function call per access yield very poor performance (24 times slower than the best on the Altix, five times slower on the Origin) Generic strategy (with intermediate temporaries) delivers only 50-60% of performance on the Altix and 30-40% of performance on the Origin for vectorized code (except for Copy kernel) Pointer initialization hoisting is crucial for Fortran 90 pointers remote accesses and desirable for Cray pointers Similarly coded OpenMP version has comparable performance on the Altix (90% for the scale kernel) and 86-90% on the Origin

Spark98 Based on CMU’s earthquake simulation code Computes sparse matrix-vector product Irregular application with fine-grain accesses Matrix distribution and computation partitioning is done offline (sf2 traces) Spark98 computes partial product locally, then assembles the result across processors

Spark98 (cont.) Versions Serial (Fortran kernel, ported from C) MPI (Fortran kernel, ported from C) Hybrid (best shared memory threaded version) CAF versions (based on MPI version): CAF Packed PUTs CAF Packed GETs CAF GETs (computation with remote data accessed “in place”)

Spark98 GETs Result Assembly v2(:,:) = v(:,:) call sync_all() do s = 0, subdomains-1 if (commindex(s) < commindex(s+1)) then pos = commindex(s) comm_len = commindex(s+1) - pos v(:, comm(pos:pos+comm_len-1)) = & v(:, comm(pos:pos+comm_len-1)) + & v2(:, comm_gets(pos:pos+comm_len-1))[s] end if end do call sync_all()

Spark98 GETs Result Assembly v2(:,:) = v(:,:) call sync_all() do s = 0, subdomains-1 if (commindex(s) < commindex(s+1)) then pos = commindex(s) comm_len = commindex(s+1) - pos v(:, comm(pos:pos+comm_len-1)) = & v(:, comm(pos:pos+comm_len-1)) + & v2(:, comm_gets(pos:pos+comm_len-1))[s] end if end do call sync_all()

Spark98 Performance on Altix Performance of all CAF versions is comparable to that of MPI and better on large number of CPUs CAF GETs is simple and more “natural” to code, but up to 13% slower Without considering locality, applications do not scale on NUMA architectures (Hybrid) ARMCI library is more efficient than MPI

NAS MG and SP Versions: MPI (NPB 2.3) CAF (based on MPI NPB 2.3) Generic code generation with subroutine argument co- array representation (procedure splitting) Shared memory code generation (Fortran 90 pointers; vectorized source code) with subroutine argument co- array representation OpenMP (NPB 3.0) Class C

NAS SP Performance on Altix Performance of CAF versions is comparable to that of MPI CAF-generic has better performance than CAF-shm because it uses memcpy, which hides latency by keeping optimal number of memory ops in flight OpenMP scales poorly

NAS MG Performance on Altix

Conclusions Direct load/store communication improves performance of fine-grain accesses by a factor of 24 on the Altix 3000 and five on the Origin 2000 “In-place” data use in CAF statements incurs acceptable abstraction overhead Performance comparable to that of MPI codes for fine- and coarse-grain applications We plan to implement in cafc optimal, architecture dependent, code generation for local and remote co-array accesses