An Emerging, Portable Co-Array Fortran Compiler for High-Performance Computing Daniel Chavarría-Miranda, Cristian Coarfa, Yuri.

Slides:



Advertisements
Similar presentations
U NIVERSITY OF D ELAWARE C OMPUTER & I NFORMATION S CIENCES D EPARTMENT Optimizing Compilers CISC 673 Spring 2009 Potential Languages of the Future Chapel,
Advertisements

Introduction to the Partitioned Global Address Space (PGAS) Programming Model David E. Hudak, Ph.D. Program Director for HPC Engineering
Computer Architecture CSCE 350
1 An Evaluation of Global Address Space Languages: Co-Array Fortran and Unified Parallel C Cristian Coarfa, Yuri Dotsenko, John Mellor-Crummey Rice University.
Reference: Message Passing Fundamentals.
Co-array Fortran Performance and Potential: an NPB Experimental Study Cristian Coarfa Yuri Dotsenko Jason Lee EckhardtJohn Mellor-Crummey Department of.
DISTRIBUTED AND HIGH-PERFORMANCE COMPUTING CHAPTER 7: SHARED MEMORY PARALLEL PROGRAMMING.
Tuesday, September 12, 2006 Nothing is impossible for people who don't have to do it themselves. - Weiler.
High Performance Fortran (HPF) Source: Chapter 7 of "Designing and building parallel programs“ (Ian Foster, 1995)
Introduction to Fortran Fortran Evolution Drawbacks of FORTRAN 77 Fortran 90 New features Advantages of Additions.
High Performance Fortran (HPF) Source: Chapter 7 of "Designing and building parallel programs“ (Ian Foster, 1995)
Optimizing Compilers for Modern Architectures Compiling High Performance Fortran Allen and Kennedy, Chapter 14.
OpenCL Introduction A TECHNICAL REVIEW LU OCT
Lecture 29 Fall 2006 Lecture 29: Parallel Programming Overview.
Lecture 4: Parallel Programming Models. Parallel Programming Models Parallel Programming Models: Data parallelism / Task parallelism Explicit parallelism.
Antoine Monsifrot François Bodin CAPS Team Computer Aided Hand Tuning June 2001.
ICOM 5995: Performance Instrumentation and Visualization for High Performance Computer Systems Lecture 7 October 16, 2002 Nayda G. Santiago.
Parallel Programming Models Jihad El-Sana These slides are based on the book: Introduction to Parallel Computing, Blaise Barney, Lawrence Livermore National.
1 A Multi-platform Co-Array Fortran Compiler Yuri Dotsenko Cristian Coarfa John Mellor-Crummey Department of Computer Science Rice University Houston,
CSE 260 – Parallel Processing UCSD Fall 2006 A Performance Characterization of UPC Presented by – Anup Tapadia Fallon Chen.
1 John Mellor-Crummey Cristian Coarfa, Yuri Dotsenko Department of Computer Science Rice University Experiences Building a Multi-platform Compiler for.
Compilation Technology SCINET compiler workshop | February 17-18, 2009 © 2009 IBM Corporation Software Group Coarray: a parallel extension to Fortran Jim.
AN EXTENDED OPENMP TARGETING ON THE HYBRID ARCHITECTURE OF SMP-CLUSTER Author : Y. Zhao 、 C. Hu 、 S. Wang 、 S. Zhang Source : Proceedings of the 2nd IASTED.
Support for Debugging Automatically Parallelized Programs Robert Hood Gabriele Jost CSC/MRJ Technology Solutions NASA.
Co-Array Fortran Open-source compilers and tools for scalable global address space computing John Mellor-Crummey Rice University.
Case Study in Computational Science & Engineering - Lecture 2 1 Parallel Architecture Models Shared Memory –Dual/Quad Pentium, Cray T90, IBM Power3 Node.
GPU Architecture and Programming
High Performance Fortran (HPF) Source: Chapter 7 of "Designing and building parallel programs“ (Ian Foster, 1995)
1 Parallel Programming Aaron Bloomfield CS 415 Fall 2005.
Improving I/O with Compiler-Supported Parallelism Why Should We Care About I/O? Disk access speeds are much slower than processor and memory access speeds.
A Multi-platform Co-array Fortran Compiler for High-Performance Computing John Mellor-Crummey, Yuri Dotsenko, Cristian Coarfa {johnmc, dotsenko,
Introduction to OpenMP Eric Aubanel Advanced Computational Research Laboratory Faculty of Computer Science, UNB Fredericton, New Brunswick.
Message-Passing Computing Chapter 2. Programming Multicomputer Design special parallel programming language –Occam Extend existing language to handle.
Experiences with Co-array Fortran on Hardware Shared Memory Platforms Yuri DotsenkoCristian Coarfa John Mellor-CrummeyDaniel Chavarria-Miranda Rice University,
A Multi-platform Co-Array Fortran Compiler for High-Performance Computing Cristian Coarfa, Yuri Dotsenko, John Mellor-Crummey {dotsenko, ccristi,
A Pattern Language for Parallel Programming Beverly Sanders University of Florida.
3/12/2013Computer Engg, IIT(BHU)1 OpenMP-1. OpenMP is a portable, multiprocessing API for shared memory computers OpenMP is not a “language” Instead,
CPE779: Shared Memory and OpenMP Based on slides by Laxmikant V. Kale and David Padua of the University of Illinois.
Co-array Fortran: Compilation, Performance, Languages Issues Cristian Coarfa Yuri Dotsenko John Mellor-Crummey Department of Computer Science Rice University.
A Parallel Communication Infrastructure for STAPL
Parallel Patterns.
Support for Program Analysis as a First-Class Design Constraint in Legion Michael Bauer 02/22/17.
Models and Languages for Parallel Computation
Computer Engg, IIT(BHU)
TerraForm3D Plasma Works 3D Engine & USGS Terrain Modeler
The Future of Fortran is Bright …
Department of Computer Science University of California, Santa Barbara
HPC User Forum 2012 Panel on Potential Disruptive Technologies Emerging Parallel Programming Approaches Guang R. Gao Founder ET International.
Dycore Rewrite Tobias Gysi.
MPI-Message Passing Interface
Milind A. Bhandarkar Adaptive MPI Milind A. Bhandarkar
Compiler Back End Panel
Summary Background Introduction in algorithms and applications
CSCE569 Parallel Computing
Experiences with Sweep3D Implementations in Co-array Fortran
Compiler Back End Panel
Q: What Does the Future Hold for “Parallel” Languages?
What is Concurrent Programming?
Parallelization of An Example Program
May 19 Lecture Outline Introduce MPI functionality
Bin Ren, Gagan Agrawal, Brad Chamberlain, Steve Deitz
Immersed Boundary Method Simulation in Titanium Objectives
Parallel Programming in C with MPI and OpenMP
Department of Computer Science University of California, Santa Barbara
An Orchestration Language for Parallel Objects
RUN-TIME STORAGE Chuen-Liang Chen Department of Computer Science
Support for Adaptivity in ARMCI Using Migratable Objects
Run-time environments
Question 1 How are you going to provide language and/or library (or other?) support in Fortran, C/C++, or another language for massively parallel programming.
Programming Parallel Computers
Presentation transcript:

An Emerging, Portable Co-Array Fortran Compiler for High-Performance Computing Daniel Chavarría-Miranda, Cristian Coarfa, Yuri Dotsenko, John Mellor-Crummey {danich, ccristi, dotsenko, johnmc}@cs.rice.edu Co-Array Fortran A sensible alternative to these extremes Programming Models for High-Performance Computing MPI HPF Simple and expressive models for high performance programming based on extensions to widely used languages Performance: users control data and computation partitioning Portability: same language for SMPs, MPPs, and clusters Programmability: global address space for simplicity Portable and widely used The programmer has explicit control over data locality and communication Using MPI can be difficult and error prone Most of the burden for communication optimization falls on application developers; compiler support is underutilized The compiler is responsible for communication and data locality Annotated sequential code (semiautomatic parallelization) Requires heroic compiler technology The model limits the application paradigms: extensions to the standard are required for supporting irregular computation Co-Array Fortran Language Explicit Data and Computation Partitioning Finite Element Example SPMD process images number of images fixed during execution images operate asynchronously Both private and shared data real a(20,20) private: a 20x20 array in each image real a(20,20) [*] shared: a 20x20 array in each image Simple one-sided shared memory communication x(:,j:j+2) = a(r,:) [p:p+2] copy rows from p:p+2 into local columns Flexible synchronization sync_team(team [,wait]) team = a vector of process ids to synchronize with wait = a vector of processes to wait for (a subset of team) Pointers and dynamic allocation Parallel I/O subroutine assemble(start, prin, ghost, neib, x) integer :: start(:), prin(:), ghost(:), neib(:) integer :: k1, k2, p real :: x(:) [*] call sync_all(neib) do p = 1, size(neib) ! Update from ghost regions k1 = start(p); k2 = start(p+1)-1 x(prin(k1:k2)) = x(prin(k1:k2)) + x(ghost(k1:k2)) [neib(p)] enddo do p = 1, size(neib) ! Update the ghost regions x(ghost(k1:k2)) [neib(p)] = x(prin(k1:k2)) call sync_all end subroutine assemble integer A(10,10)[*] A(10,10) A(10,10) A(10,10) image 0 image 1 image N A(1:10,1:10)[2] = A(1:10,1:10)[2] A(10,10) A(10,10) A(10,10) image N image 0 image 1 Co-Array Fortran enables simple expression of complicated communication patterns Research Focus Sum Reduction Example Compiler-directed optimization of communication tailored for target platform communication fabric Transform as useful from 1-sided to 1.5 sided, two-sided and collective communication Generate both fine-grain load/store and calls to communication libraries as necessary Multi-model code for hierarchical architectures Platform-driven optimization of computation Compiler-directed parallel I/O with UIUC Enhancements to Co-Array Fortran synch. model Original Co-Array Program Resulting Fortran 90 parallel program program eCafSum integer, save :: caf2d(10, 10)[*] integer :: sum2d(10, 10) integer :: me, num_imgs, i ! what is my image number me = this_image() ! how many images are running num_imgs = num_images() ! initial data assignment caf2d(1:10, 1:10) = me call sync_all() ! compute the sum for 2d co-array if (me .eq. 1) then sum2d(1:10, 1:10) = 0 do i = 1, num_imgs sum2d(1:10, 1:10) = sum2d(1:10,1:10)& + caf2d(1:10,1:10)[i] end do write(*,*) 'sum2d = ', sum2d endif end program eCafSum program eCafSum < Co-array Fortran initialization > ecafsum_caf2d%ptr(1:10, 1:10) = me call CafArmciSynchAll() if (me .eq. 1) then sum2d(1:10, 1:10) = 0 do i = 1, num_imgs, 1 allocate( cafTemp_2%ptr(1:10, 1:10) ) cafTemp_4%ptr =>ecafsum_caf2d%ptr(1:10,1:10) call CafArmciGetS(ecafsum_caf2d%handle, i, cafTemp_4, cafTemp_2) sum2d(1:10, 1:10) = cafTemp_2%ptr(1:10,1:10)+sum2d(1:10, 1:10) deallocate( cafTemp_2%ptr ) end do write(*,*) 'sum2d = ', sum2d(1:10, 1:10) endif call CafArmciFinalize() end program eCafSum Current Implementation Status Source-to-source code generation for wide portability Open source compiler will be available Working prototype for a subset of the language Initial compiler implementation performs no optimization each co-array access is transformed into a get/put operation at the same point in the code Code generation for the widely-portable ARMCI communication library Front-end based on production-quality Open64 front end, modified to support source-to-source compilation Successfully compiled and executed NAS MG on SGI Origin; performance similar to hand coded MPI