Co-Array Fortran Open-source compilers and tools for scalable global address space computing John Mellor-Crummey Rice University.

Slides:



Advertisements
Similar presentations
Unified Parallel C at LBNL/UCB Implementing a Global Address Space Language on the Cray X1 Christian Bell and Wei Chen.
Advertisements

U NIVERSITY OF D ELAWARE C OMPUTER & I NFORMATION S CIENCES D EPARTMENT Optimizing Compilers CISC 673 Spring 2009 Potential Languages of the Future Chapel,
Introduction to the Partitioned Global Address Space (PGAS) Programming Model David E. Hudak, Ph.D. Program Director for HPC Engineering
1 An Evaluation of Global Address Space Languages: Co-Array Fortran and Unified Parallel C Cristian Coarfa, Yuri Dotsenko, John Mellor-Crummey Rice University.
Study of Hurricane and Tornado Operating Systems By Shubhanan Bakre.
Introductory Courses in High Performance Computing at Illinois David Padua.
Co-array Fortran Performance and Potential: an NPB Experimental Study Cristian Coarfa Yuri Dotsenko Jason Lee EckhardtJohn Mellor-Crummey Department of.
DISTRIBUTED AND HIGH-PERFORMANCE COMPUTING CHAPTER 7: SHARED MEMORY PARALLEL PROGRAMMING.
Tuesday, September 12, 2006 Nothing is impossible for people who don't have to do it themselves. - Weiler.
Unified Parallel C at LBNL/UCB Implementing a Global Address Space Language on the Cray X1: the Berkeley UPC Experience Christian Bell and Wei Chen CS252.
CS 240A: Models of parallel programming: Machines, languages, and complexity measures.
Scripting Languages For Virtual Worlds. Outline Necessary Features Classes, Prototypes, and Mixins Static vs. Dynamic Typing Concurrency Versioning Distribution.
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Parallel Programming in C with MPI and OpenMP Michael J. Quinn.
Active Messages: a Mechanism for Integrated Communication and Computation von Eicken et. al. Brian Kazian CS258 Spring 2008.
EEC-681/781 Distributed Computing Systems Lecture 3 Wenbing Zhao Department of Electrical and Computer Engineering Cleveland State University
State Machines Timing Computer Bus Computer Performance Instruction Set Architectures RISC / CISC Machines.
Multiprocessors CSE 471 Aut 011 Multiprocessors - Flynn’s Taxonomy (1966) Single Instruction stream, Single Data stream (SISD) –Conventional uniprocessor.
Tile Reduction: the first step towards tile aware parallelization in OpenMP Ge Gan Department of Electrical and Computer Engineering Univ. of Delaware.
UPC at CRD/LBNL Kathy Yelick Dan Bonachea, Jason Duell, Paul Hargrove, Parry Husbands, Costin Iancu, Mike Welcome, Christian Bell.
GASP: A Performance Tool Interface for Global Address Space Languages & Libraries Adam Leko 1, Dan Bonachea 2, Hung-Hsun Su 1, Bryan Golden 1, Hans Sherburne.
Programming the Cell Multiprocessor Işıl ÖZ. Outline Cell processor – Objectives – Design and architecture Programming the cell – Programming models CellSs.
Computer Architecture Parallel Processing
Lecture 29 Fall 2006 Lecture 29: Parallel Programming Overview.
Budapest, November st ALADIN maintenance and phasing workshop Short introduction to OpenMP Jure Jerman, Environmental Agency of Slovenia.
Lecture 4: Parallel Programming Models. Parallel Programming Models Parallel Programming Models: Data parallelism / Task parallelism Explicit parallelism.
ET E.T. International, Inc. X-Stack: Programming Challenges, Runtime Systems, and Tools Brandywine Team May2013.
ICOM 5995: Performance Instrumentation and Visualization for High Performance Computer Systems Lecture 7 October 16, 2002 Nayda G. Santiago.
1 A Multi-platform Co-Array Fortran Compiler Yuri Dotsenko Cristian Coarfa John Mellor-Crummey Department of Computer Science Rice University Houston,
Center for Programming Models for Scalable Parallel Computing: Project Meeting Report Libraries, Languages, and Execution Models for Terascale Applications.
CCA Common Component Architecture Manoj Krishnan Pacific Northwest National Laboratory MCMD Programming and Implementation Issues.
Chapter 3 Parallel Algorithm Design. Outline Task/channel model Task/channel model Algorithm design methodology Algorithm design methodology Case studies.
1 John Mellor-Crummey Cristian Coarfa, Yuri Dotsenko Department of Computer Science Rice University Experiences Building a Multi-platform Compiler for.
Compilation Technology SCINET compiler workshop | February 17-18, 2009 © 2009 IBM Corporation Software Group Coarray: a parallel extension to Fortran Jim.
AN EXTENDED OPENMP TARGETING ON THE HYBRID ARCHITECTURE OF SMP-CLUSTER Author : Y. Zhao 、 C. Hu 、 S. Wang 、 S. Zhang Source : Proceedings of the 2nd IASTED.
Support for Debugging Automatically Parallelized Programs Robert Hood Gabriele Jost CSC/MRJ Technology Solutions NASA.
UPC Applications Parry Husbands. Roadmap Benchmark small applications and kernels —SPMV (for iterative linear/eigen solvers) —Multigrid Develop sense.
Programming Models & Runtime Systems Breakout Report MICS PI Meeting, June 27, 2002.
Parallel Programming in Split-C David E. Culler et al. (UC-Berkeley) Presented by Dan Sorin 1/20/06.
4.2.1 Programming Models Technology drivers – Node count, scale of parallelism within the node – Heterogeneity – Complex memory hierarchies – Failure rates.
Case Study in Computational Science & Engineering - Lecture 2 1 Parallel Architecture Models Shared Memory –Dual/Quad Pentium, Cray T90, IBM Power3 Node.
High Performance Fortran (HPF) Source: Chapter 7 of "Designing and building parallel programs“ (Ian Foster, 1995)
HPC User Forum Back End Compiler Panel SiCortex Perspective Kevin Harris Compiler Manager April 2009.
1 Parallel Programming Aaron Bloomfield CS 415 Fall 2005.
High-Level, One-Sided Models on MPI: A Case Study with Global Arrays and NWChem James Dinan, Pavan Balaji, Jeff R. Hammond (ANL); Sriram Krishnamoorthy.
University of Minnesota Comments on Co-Array Fortran Robert W. Numrich Minnesota Supercomputing Institute University of Minnesota, Minneapolis.
Lecture 4 TTH 03:30AM-04:45PM Dr. Jianjun Hu CSCE569 Parallel Computing University of South Carolina Department of.
October 11, 2007 © 2007 IBM Corporation Multidimensional Blocking in UPC Christopher Barton, Călin Caşcaval, George Almási, Rahul Garg, José Nelson Amaral,
A Multi-platform Co-array Fortran Compiler for High-Performance Computing John Mellor-Crummey, Yuri Dotsenko, Cristian Coarfa {johnmc, dotsenko,
1 Qualifying ExamWei Chen Unified Parallel C (UPC) and the Berkeley UPC Compiler Wei Chen the Berkeley UPC Group 3/11/07.
Introduction to OpenMP Eric Aubanel Advanced Computational Research Laboratory Faculty of Computer Science, UNB Fredericton, New Brunswick.
HPC Components for CCA Manoj Krishnan and Jarek Nieplocha Computational Sciences and Mathematics Division Pacific Northwest National Laboratory.
Unified Parallel C Kathy Yelick EECS, U.C. Berkeley and NERSC/LBNL NERSC Team: Dan Bonachea, Jason Duell, Paul Hargrove, Parry Husbands, Costin Iancu,
Experiences with Co-array Fortran on Hardware Shared Memory Platforms Yuri DotsenkoCristian Coarfa John Mellor-CrummeyDaniel Chavarria-Miranda Rice University,
Parallelization Strategies Laxmikant Kale. Overview OpenMP Strategies Need for adaptive strategies –Object migration based dynamic load balancing –Minimal.
CCA Common Component Architecture Distributed Array Component based on Global Arrays Manoj Krishnan, Jarek Nieplocha High Performance Computing Group Pacific.
A Multi-platform Co-Array Fortran Compiler for High-Performance Computing Cristian Coarfa, Yuri Dotsenko, John Mellor-Crummey {dotsenko, ccristi,
A Pattern Language for Parallel Programming Beverly Sanders University of Florida.
3/12/2013Computer Engg, IIT(BHU)1 OpenMP-1. OpenMP is a portable, multiprocessing API for shared memory computers OpenMP is not a “language” Instead,
1 HPJAVA I.K.UJJWAL 07M11A1217 Dept. of Information Technology B.S.I.T.
Parallel Performance Wizard: A Generalized Performance Analysis Tool Hung-Hsun Su, Max Billingsley III, Seth Koehler, John Curreri, Alan D. George PPW.
Application of Design Patterns to Geometric Decompositions V. Balaji, Thomas L. Clune, Robert W. Numrich and Brice T. Womack.
Co-array Fortran: Compilation, Performance, Languages Issues Cristian Coarfa Yuri Dotsenko John Mellor-Crummey Department of Computer Science Rice University.
An Emerging, Portable Co-Array Fortran Compiler for High-Performance Computing Daniel Chavarría-Miranda, Cristian Coarfa, Yuri.
For Massively Parallel Computation The Chaotic State of the Art
Parallel Programming By J. H. Wang May 2, 2017.
Computer Engg, IIT(BHU)
The Future of Fortran is Bright …
Experiences with Sweep3D Implementations in Co-array Fortran
Parallel Programming in C with MPI and OpenMP
Support for Adaptivity in ARMCI Using Migratable Objects
Presentation transcript:

Co-Array Fortran Open-source compilers and tools for scalable global address space computing John Mellor-Crummey Rice University

Center for Programming Models for Scalable Parallel Computing2Review, March 13, 2003 Outline Co-array Fortran language overview CAF compiler status and preliminary results language and compiler research issues interactions OpenMP compiler and runtime strategies for improving scalability Dragon tool hybrid MPI + OpenMP Open64 infrastructure source-to-source and source-to-object code infrastructure

Center for Programming Models for Scalable Parallel Computing3Review, March 13, 2003 Co-Array Fortran (CAF) Explicitly-parallel extension of Fortran 90/95 (Numrich & Reid) Global address space SPMD parallel programming model one-sided communication Simple, two-level model that supports locality management local vs. remote memory Programmer control over performance critical decisions data partitioning communication Suitable for mapping to a range of parallel architectures shared memory, message passing, hybrid, PIM Much in common with UPC

Center for Programming Models for Scalable Parallel Computing4Review, March 13, 2003 CAF Programming Model Features SPMD process images fixed number of images during execution images operate asynchronously Both private and shared data real y(20, 20) a private 20x20 array in each image real y(20, 20) [*] a shared 20x20 array in each image Simple one-sided shared-memory communication x(:, j:j+2) = y(r, :) [p:p+2] copy rows from p:p+2 into local columns Flexible synchronization sync_team(notify [,wait]) notify = a vector of process ids to signal wait = a vector of process ids to wait for Pointers and (perhaps asymmetric) dynamic allocation Parallel I/O

Center for Programming Models for Scalable Parallel Computing5Review, March 13, 2003 One-sided Communication with Co-Arrays integer a(10,20)[*] if (thisimage() > 1) a(1:5,1:10) = a(1:5,1:10)[thisimage()-1] a(10,20) image 1image 2image N image 1image 2image N

Center for Programming Models for Scalable Parallel Computing6Review, March 13, 2003 Finite Element Example (Numrich) subroutine assemble(start, prin, ghost, neib, x) integer :: start(:), prin(:), ghost(:), neib(:), k1, k2, p real :: x(:) [*] call sync_all(neib) do p = 1, size(neib) ! Add contributions from ghost regions k1 = start(p); k2 = start(p+1)-1 x(prin(k1:k2)) = x(prin(k1:k2)) + x(ghost(k1:k2)) [neib(p)] enddo call sync_all(neib) do p = 1, size(neib) ! Update the ghosts k1 = start(p); k2 = start(p+1)-1 x(ghost(k1:k2)) [neib(p)] = x(prin(k1:k2)) enddo call synch_all end subroutine assemble

Center for Programming Models for Scalable Parallel Computing7Review, March 13, 2003 Portable CAF Compiler Compile CAF to Fortran 90 + runtime support library source-to-source code generation for wide portability expect best performance by leveraging vendor F90 compiler Co-arrays access data in generated code using F90 pointers allocate storage with dope vector initialization outside F90 Porting to a new compiler / architecture synthesize compatible dope vectors for co-array storage tailor communication to architecture

Center for Programming Models for Scalable Parallel Computing8Review, March 13, 2003 CAF Compiler Status Near production-quality F90 front end from Open64 Working prototype for a CAF subset allocate co-arrays using static constructor-like strategy co-array access remote data access uses ARMCI get/put process local data access uses load/store synch_all, synch_team synchronization multi-dimensional array section operations Successfully compiled and executed NAS MG platforms: SGI Origin, IA64 Myrinet performance similar to hand-coded MPI

Center for Programming Models for Scalable Parallel Computing9Review, March 13, 2003 NAS MG Efficiency (Class C) IA64/Myrinet 2000

Center for Programming Models for Scalable Parallel Computing10Review, March 13, 2003 CAF Compiler Coming Attractions Co-arrays as procedure arguments Triplet notation for co-dimensions Co-arrays of user defined types types can contain pointers Dynamic allocation of co-arrays Compiler support for parallel I/O

Center for Programming Models for Scalable Parallel Computing11Review, March 13, 2003 CAF Language Research Issues Synchronization locks instead of critical sections split-phase primitives synch_team/synch_all semantics can require pairwise notification may need synchronization matching hints to enable optimization Language support for efficient reductions manually-coded reductions unlikely to yield portable performance Memory consistency model for co-array data Controlling process to processor mapping Support for hierarchical locality domains support work sharing on SMPs?

Center for Programming Models for Scalable Parallel Computing12Review, March 13, 2003 CAF Compiler Research Issues Aim for performance transparency Compiler optimization of communication and I/O multi-mode communication: direct load/store + RDMA combine synchronization with communication put/get with flag one-sided  two-sided communication transform from get to put communication exploit split-phase communication and synchronization communication vectorization latency hiding for communication and parallel I/O platform-tailored optimization synchronization strength reduction Interoperability with other parallel programming models Optimizations to improve node performance

Center for Programming Models for Scalable Parallel Computing13Review, March 13, 2003 CAF Interactions Working with CAF code from Numrich and Wallcraft (NRL) Refining ARMCI synchronization with Nieplocha Designing parallel I/O design for CAF with UIUC Exploring language design with Numrich and Nieplocha Coordinating with Rasmussen (LANL) on Fortran 90 array dope vector interface library Planning a fall CAF workshop at PSC coordinating with Ralph Roskies, Sergiu Sanielevici encouragement from Rich Hirsch, Fred Johnson