Co-array Fortran Performance and Potential: an NPB Experimental Study Cristian Coarfa Yuri Dotsenko Jason Lee EckhardtJohn Mellor-Crummey Department of.

Slides:

Advertisements

Similar presentations

Unified Parallel C at LBNL/UCB Implementing a Global Address Space Language on the Cray X1 Christian Bell and Wei Chen.

Advertisements

Introduction to the Partitioned Global Address Space (PGAS) Programming Model David E. Hudak, Ph.D. Program Director for HPC Engineering

1 An Evaluation of Global Address Space Languages: Co-Array Fortran and Unified Parallel C Cristian Coarfa, Yuri Dotsenko, John Mellor-Crummey Rice University.

PGAS Language Update Kathy Yelick. PGAS Languages: Why use 2 Programming Models when 1 will do? Global address space: thread may directly read/write remote.

Introductory Courses in High Performance Computing at Illinois David Padua.

University of Houston So What’s Exascale Again?. University of Houston The Architects Did Their Best… Scale of parallelism Multiple kinds of parallelism.

Unified Parallel C at LBNL/UCB The Berkeley UPC Compiler: Implementation and Performance Wei Chen, Dan Bonachea, Jason Duell, Parry Husbands, Costin Iancu,

1 Presentation at the 4 th PMEO-PDS Workshop Benchmark Measurements of Current UPC Platforms Zhang Zhang and Steve Seidel Michigan Technological University.

Presented by Rengan Xu LCPC /16/2014

DISTRIBUTED AND HIGH-PERFORMANCE COMPUTING CHAPTER 7: SHARED MEMORY PARALLEL PROGRAMMING.

Tuesday, September 12, 2006 Nothing is impossible for people who don't have to do it themselves. - Weiler.

High Performance Fortran (HPF) Source: Chapter 7 of "Designing and building parallel programs“ (Ian Foster, 1995)

Unified Parallel C at LBNL/UCB Implementing a Global Address Space Language on the Cray X1: the Berkeley UPC Experience Christian Bell and Wei Chen CS252.

ECE669 L5: Grid Computations February 12, 2004 ECE 669 Parallel Computer Architecture Lecture 5 Grid Computations.

CS 240A: Models of parallel programming: Machines, languages, and complexity measures.

Java for High Performance Computing Jordi Garcia Almiñana 14 de Octubre de 1998 de la era post-internet.

Active Messages: a Mechanism for Integrated Communication and Computation von Eicken et. al. Brian Kazian CS258 Spring 2008.

State Machines Timing Computer Bus Computer Performance Instruction Set Architectures RISC / CISC Machines.

Multiprocessors CSE 471 Aut 011 Multiprocessors - Flynn’s Taxonomy (1966) Single Instruction stream, Single Data stream (SISD) –Conventional uniprocessor.

High Performance Fortran (HPF) Source: Chapter 7 of "Designing and building parallel programs“ (Ian Foster, 1995)

Tile Reduction: the first step towards tile aware parallelization in OpenMP Ge Gan Department of Electrical and Computer Engineering Univ. of Delaware.

UPC at CRD/LBNL Kathy Yelick Dan Bonachea, Jason Duell, Paul Hargrove, Parry Husbands, Costin Iancu, Mike Welcome, Christian Bell.

Optimizing Compilers for Modern Architectures Compiling High Performance Fortran Allen and Kennedy, Chapter 14.

Unified Parallel C at LBNL/UCB FT Benchmark in UPC Christian Bell and Rajesh Nishtala.

Introduction to Symmetric Multiprocessors Süha TUNA Bilişim Enstitüsü UHeM Yaz Çalıştayı

Lecture 29 Fall 2006 Lecture 29: Parallel Programming Overview.

Lecture 4: Parallel Programming Models. Parallel Programming Models Parallel Programming Models: Data parallelism / Task parallelism Explicit parallelism.

Exploiting Domain-Specific High-level Runtime Support for Parallel Code Generation Xiaogang Li Ruoming Jin Gagan Agrawal Department of Computer and Information.

ICOM 5995: Performance Instrumentation and Visualization for High Performance Computer Systems Lecture 7 October 16, 2002 Nayda G. Santiago.

1 A Multi-platform Co-Array Fortran Compiler Yuri Dotsenko Cristian Coarfa John Mellor-Crummey Department of Computer Science Rice University Houston,

Center for Programming Models for Scalable Parallel Computing: Project Meeting Report Libraries, Languages, and Execution Models for Terascale Applications.

A Metadata Based Approach For Supporting Subsetting Queries Over Parallel HDF5 Datasets Vignesh Santhanagopalan Graduate Student Department Of CSE.

1 John Mellor-Crummey Cristian Coarfa, Yuri Dotsenko Department of Computer Science Rice University Experiences Building a Multi-platform Compiler for.

Compilation Technology SCINET compiler workshop | February 17-18, 2009 © 2009 IBM Corporation Software Group Coarray: a parallel extension to Fortran Jim.

AN EXTENDED OPENMP TARGETING ON THE HYBRID ARCHITECTURE OF SMP-CLUSTER Author ： Y. Zhao 、 C. Hu 、 S. Wang 、 S. Zhang Source ： Proceedings of the 2nd IASTED.

Support for Debugging Automatically Parallelized Programs Robert Hood Gabriele Jost CSC/MRJ Technology Solutions NASA.

UPC Applications Parry Husbands. Roadmap Benchmark small applications and kernels —SPMV (for iterative linear/eigen solvers) —Multigrid Develop sense.

Co-Array Fortran Open-source compilers and tools for scalable global address space computing John Mellor-Crummey Rice University.

Case Study in Computational Science & Engineering - Lecture 2 1 Parallel Architecture Models Shared Memory –Dual/Quad Pentium, Cray T90, IBM Power3 Node.

High Performance Fortran (HPF) Source: Chapter 7 of "Designing and building parallel programs“ (Ian Foster, 1995)

Supercomputing ‘99 Parallelization of a Dynamic Unstructured Application using Three Leading Paradigms Leonid Oliker NERSC Lawrence Berkeley National Laboratory.

HPC User Forum Back End Compiler Panel SiCortex Perspective Kevin Harris Compiler Manager April 2009.

1 Parallel Programming Aaron Bloomfield CS 415 Fall 2005.

High-Level, One-Sided Models on MPI: A Case Study with Global Arrays and NWChem James Dinan, Pavan Balaji, Jeff R. Hammond (ANL); Sriram Krishnamoorthy.

Spring 2003CSE P5481 Issues in Multiprocessors Which programming model for interprocessor communication shared memory regular loads & stores message passing.

Lecture 4 TTH 03:30AM-04:45PM Dr. Jianjun Hu CSCE569 Parallel Computing University of South Carolina Department of.

A Multi-platform Co-array Fortran Compiler for High-Performance Computing John Mellor-Crummey, Yuri Dotsenko, Cristian Coarfa {johnmc, dotsenko,

1 Qualifying ExamWei Chen Unified Parallel C (UPC) and the Berkeley UPC Compiler Wei Chen the Berkeley UPC Group 3/11/07.

Gedae, Inc. Gedae: Auto Coding to a Virtual Machine Authors: William I. Lundgren, Kerry B. Barnes, James W. Steed HPEC 2004.

Message-Passing Computing Chapter 2. Programming Multicomputer Design special parallel programming language –Occam Extend existing language to handle.

Unified Parallel C Kathy Yelick EECS, U.C. Berkeley and NERSC/LBNL NERSC Team: Dan Bonachea, Jason Duell, Paul Hargrove, Parry Husbands, Costin Iancu,

Programmability Hiroshi Nakashima Thomas Sterling.

Experiences with Co-array Fortran on Hardware Shared Memory Platforms Yuri DotsenkoCristian Coarfa John Mellor-CrummeyDaniel Chavarria-Miranda Rice University,

OpenMP for Networks of SMPs Y. Charlie Hu, Honghui Lu, Alan L. Cox, Willy Zwaenepoel ECE1747 – Parallel Programming Vicky Tsang.

CCA Common Component Architecture Distributed Array Component based on Global Arrays Manoj Krishnan, Jarek Nieplocha High Performance Computing Group Pacific.

A Multi-platform Co-Array Fortran Compiler for High-Performance Computing Cristian Coarfa, Yuri Dotsenko, John Mellor-Crummey {dotsenko, ccristi,

An Evaluation of Data-Parallel Compiler Support for Line-Sweep Applications Daniel Chavarría-Miranda John Mellor-Crummey Dept. of Computer Science Rice.

1 HPJAVA I.K.UJJWAL 07M11A1217 Dept. of Information Technology B.S.I.T.

Benchmarking and Applications. Purpose of Our Benchmarking Effort Reveal compiler (and run-time systems) weak points and lack of adequate automatic optimizations.

Is MPI still part of the solution ? George Bosilca Innovative Computing Laboratory Electrical Engineering and Computer Science Department University of.

Parallel Performance Wizard: A Generalized Performance Analysis Tool Hung-Hsun Su, Max Billingsley III, Seth Koehler, John Curreri, Alan D. George PPW.

11 Brian Van Straalen Portable Performance Discussion August 7, FASTMath SciDAC Institute.

Co-array Fortran: Compilation, Performance, Languages Issues Cristian Coarfa Yuri Dotsenko John Mellor-Crummey Department of Computer Science Rice University.

A Parallel Communication Infrastructure for STAPL

An Emerging, Portable Co-Array Fortran Compiler for High-Performance Computing Daniel Chavarría-Miranda, Cristian Coarfa, Yuri.

For Massively Parallel Computation The Chaotic State of the Art

Parallel Programming By J. H. Wang May 2, 2017.

Experiences with Sweep3D Implementations in Co-array Fortran

Parallel Programming in C with MPI and OpenMP

Programming Parallel Computers

Presentation transcript:

Co-array Fortran Performance and Potential: an NPB Experimental Study Cristian Coarfa Yuri Dotsenko Jason Lee EckhardtJohn Mellor-Crummey Department of Computer Science Rice University

Parallel Programming Models Goals: Expressiveness Ease of use Performance Portability Current models: OpenMP: difficult to map on distributed memory platforms HPF: difficult to obtain high-performance on broad range of programs MPI: de facto standard; hard to program, assumptions about communication granularity are hard coded

Co-array Fortran – a Sensible Alternative Programming model overview Co-array Fortran compiler Experiments and discussion Conclusions

Co-Array Fortran (CAF) Explicitly-parallel extension of Fortran 90/95 (Numrich & Reid) Global address space SPMD parallel programming model one-sided communication Simple, two-level model that supports locality management local vs. remote memory Programmer control over performance critical decisions data partitioning communication Suitable for mapping to a range of parallel architectures shared memory, message passing, hybrid, PIM

CAF Programming Model Features SPMD process images fixed number of images during execution images operate asynchronously Both private and shared data real x(20, 20) a private 20x20 array in each image real y(20, 20) [*] a shared 20x20 array in each image Simple one-sided shared-memory communication x(:,j:j+2) = y(:,p:p+2) [r] copy columns from p:p+2 into local columns Synchronization intrinsic functions sync_all – a barrier sync_team([notify], [wait]) notify = a vector of process ids to signal wait = a vector of process ids to wait for, a subset of notify Pointers and (perhaps asymmetric) dynamic allocation Parallel I/O

One-sided Communication with Co-Arrays integer a(10,20)[*] if (this_image() > 1) a(1:10,1:2) = a(1:10,19:20)[this_image()-1] a(10,20) image 1image 2image N image 1image 2image N

Finite Element Example (Numrich) subroutine assemble(start, prin, ghost, neib, x) integer :: start(:), prin(:), ghost(:), neib(:), k1, k2, p real :: x(:) [*] call sync_all(neib) do p = 1, size(neib) ! Add contributions from neighbors k1 = start(p); k2 = start(p+1)-1 x(prin(k1:k2)) = x(prin(k1:k2)) + x(ghost(k1:k2)) [neib(p)] enddo call sync_all(neib) do p = 1, size(neib) ! Update the neighbors k1 = start(p); k2 = start(p+1)-1 x(ghost(k1:k2)) [neib(p)] = x(prin(k1:k2)) enddo call synch_all end subroutine assemble

Proposed CAF Model Refinements Initial implementations on Cray T3E and X1 led to features not suited for distributed memory platforms Key problems and solutions Restrictive memory fence semantics for procedure calls  goal: enable programmer to overlap one-sided communication with procedure calls Overly restrictive synchronization primitives  add unidirectional, point-to-point synchronization No collective operations  add CAF intrinsics for reductions, broadcast, etc.

Co-array Fortran – A Sensible Alternative Programming model overview Co-array Fortran compiler Experiments and discussion Conclusions

Portable CAF Compiler Compile CAF to Fortran 90 + runtime support library source-to-source code generation for wide portability expect best performance by leveraging vendor F90 compiler Co-arrays represent co-array data using F90 pointers allocate storage with dope vector initialization outside F90 access co-array data through pointer dereference Porting to a new compiler / architecture synthesize compatible dope vectors for co-array storage: CHASM library (LANL) tailor communication to architecture today: ARMCI (PNNL) future: compiler and tailored communication library

CAF Compiler Status Near production-quality F90 front end from Open64 Working prototype for CAF core features allocate co-arrays using static constructor-like strategy co-array access access remote data using ARMCI get/put access process local data using load/store co-array parameter passing sync_all, sync_team, sync_wait, sync_notify synchronization multi-dimensional array section operations Co-array communication inserted around statements with co-array accesses Currently no communication optimizations

Co-array Fortran – A Sensible Alternative Programming model overview Co-array Fortran compiler Experiments and Discussion Conclusions

Platform 96 HP zx6000 workstations dual 900Mhz Itanium2 32KB/256KB/1.5MB L1/L2/L3 cache 4GB ram Myrinet 2000 interconnect Red Hat Linux, kernel plus patches Intel Fortran Compiler v7.1 ARMCI 1.1-beta MPICH-GM

NAS Benchmarks 2.3 Benchmarks by NASA: Regular, dense-matrix codes: MG, BT, SP Irregular codes: CG 2-3K lines each (Fortran 77) Widely used to test parallel compiler performance NAS Versions: NPB2.3b2 : Hand-coded MPI NPB2.3-serial : Serial code extracted from MPI version NPB2.3-CAF: CAF implementation, based on the MPI version

Lessons communication aggregation and vectorization were crucial for high performance memory layout of communication buffers, co-arrays and common block/save arrays might require thorough analysis and optimization

Lessons Replacing barriers with point-to-point synchronization boosted performance dramatically Converting gets into puts also improved performance

Lesson Tradeoff between buffer size and amount of necessary synchronization: more buffer storage leads to less synchronization

Lesson Communication/computation overlap is important for performance not possible with current CAF memory fence semantics

Experiments Summary On cluster-based architectures, to achieve best performance with CAF, a user or compiler must Vectorize (and perhaps aggregate) communication Reduce synchronization strength replace all-to-all with point-to-point where sensible Convert gets into puts where gets are not a h/w primitive Consider memory layout conflicts: co-array vs. regular data Overlap communication with computation CAF language enables many of these to be performed manually at the source level Plan to automate optimizations, but compiler might need user hints

Conclusions CAF performance is comparable to highly tuned hand-coded MPI even without compiler-based communication optimizations! CAF programming model enables optimization at the source level communication vectorization synchronization strength reduction  achieve performance today rather than waiting for tomorrow’s compilers CAF is amenable to compiler analysis and optimization significant communication optimization is feasible, unlike for MPI optimizing compilers will help a wider range of programs achieve high performance applications can be tailored to fully exploit architectural characteristics e.g., shared memory vs. distributed memory vs. hybrid